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PREFACE 


This book is an introduction to regression analysis, covering simple, multiple, 
and logistic regression analysis. 

Simple and multiple regression analysis are statistical methods for predicting 
values; for example, you can use simple regression analysis to predict the number 
of iced tea orders based on the day’s high temperature or use multiple regression 
analysis to predict monthly sales of a shop based on its size and distance from 
the nearest train station. 

Logistic regression analysis is a method for predicting probability, such as 
the probability of selling a particular cake based on a certain day of the week. 

The intended readers of this book are statistics and math students who’ve 
found it difficult to grasp regression analysis, or anyone wanting to get started 
with statistical predictions and probabilities. You’ll need some basic statistical 
knowledge before you start. The Manga Guide to Statistics (No Starch Press, 
2008) is an excellent primer to prepare you for the work in this book. 

This book consists of four chapters: 

# Chapter 1: A Refreshing Glass of Math 
' Chapter 2: Simple Regression Analysis 

# Chapter 3: Multiple Regression Analysis 

# Chapter 4: Logistic Regression Analysis 

Each chapter has a manga section and a slightly more technical text section. 
You can get a basic overview from the manga, and some more useful details and 
definitions from the text sections. 

I’d like to mention a few words about Chapter 1. Although many readers may 
have already learned the topics in this chapter, like differentiation and matrix 
operations, Chapter 1 reviews these topics in context of regression analysis, 
which will be useful for the lessons that follow. If Chapter 1 is merely a refresher 
for you, that’s great. If you’ve never studied those topics or it’s been a long time 
since you have, it’s worth putting in a bit of effort to make sure you understand 
Chapter 1 first. 

In this book, the math for the calculations is covered in detail. If you’re good 
at math, you should be able to follow along and make sense of the calculations. If 
you’re not so good at math, you can just get an overview of the procedure and use 
the step-by-step instructions to find the actual answers. You don’t need to force 
yourself to understand the math part right now. Keep yourself relaxed. However, 
do take a look at the procedure of the calculations. 



We’ve rounded some of the figures in this book to make them easier to 
read, which means that some of the values may be inconsistent with the values 
you will get by calculating them yourself, though not by much. We ask for your 
understanding. 

I would like to thank my publisher, Ohmsha, for giving me the opportunity 
to write this book. I would also like to thank TREND-PRO, Co., Ltd. for turning 
my manuscript into this manga, the scenario writer re_akino, and the illustra¬ 
tor Iroha Inoue. Last but not least, I would like to thank Dr. Sakaori Fumitake of 
College of Social Relations, Rikkyo University. He provided with me invaluable 
advice, much more than he had given me when I was preparing my previous book. 
I’d like to express my deep appreciation. 

Shin Takahashi 
September 2005 
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I INONPEZ IF 
HE'LL COME IN 
AO AIN EOON... 


WHAT'S 

THIS? 
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MULTIPLE LINEAR? 
LOTS OF LINES??/ 


NOT QUITE. 


' WE USE LINEAR 
REGRESSION TO 
ESTIMATE THE 
NUMBER OF ICEP 
TEA ORPERS BASEP 
ON ONE FACTOR- 
TEMPERATURE. 


BUT IN MULTIPLE LINEAR REGRESSION ANALYSIS, 
WE USE SEVERAL FACTORS, LIKE TEMPERATURE, 
PRICE OF ICEP TEA, ANP NUMBER OF 
^ STUPENTS TAKING CLASSES NEARBY, 


facto? j 
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Store 


Advertising 

expenditure 

(yen) 


Sales 

(yen) 


X UOOI< AT AN X 

/ fxamplf of multiple \ 

/ LI NEAP PFOPFSSION \ 

/ ANALYSIS. \ 

MP. CUYMAN 15 THE CFO OF A ' 
CHAIN STOPS. IN APPITION TO 
TPACKINO SALSS, HE ALSO KEEPS 
THE FOLLOWING PSCOPPS FOP 
EACH OF HIS STOPES: 

* PISTANCS TO THE NEAPEST 
COMPETING STOPE 

• NUM3EP OF HOUSES WITHIN y 

A MILE OF THE STOPE / 

\ • APVEPTISINO EXPENPITUPE/ 


Distance Houses 
to nearest within a 

competing mile of 

store (m) the store 


ooo ooo ooo ooo 

AAA AAA AAA AAA 

□ □□ □□□ □ □□ □□□ 



WHEN HE IE CONSIPERING 
OPENING* A NEW SHOP... 


O! 


fti 





...HE CAN ESTIMATE 
SALES AT THE NEW 
SHOP EASEP ON 
HOW THE OTHER 
THREE FACTORS 
RELATE TO SALES 
AT HIS EXISTING* 
STORES. 


_ ^ ISHOUUP 

TOTALLY 

■ —I OPEN A NEW 
%/ store; 


AMAZING*/ 


m ' 0)M 

r ' '"2; 

A X7 W/77// 




THERE ARE OTHER 
METHOPS OF ANALYSIS, 
TOO, LIKE LOGISTIC 
REGRESSION 
ANALYSIS. v 



IF I 
FTUPy 
TH& 
BOOK... 






ONE PAY I 

THEN 


zzzrzz' 

CAN TALK 

MAYBE... 

in 


TO HIM 


'iU 


k ABOUT IT. 


a PROLoe ub 




































MOKETEA? q 










10 PROLOGUE 





























A RFFfcFSHINO 
6L.AS5 OF MATH 






































12 CHAPTER 1 A REFRESHING GLASS OF MATH 


















' SURE, LETS PO IT. 

regression pepenps 

ON SOME MATH... 


SO WE'LL. START 
WITH THAT. 


^ V 



NOTATION RULES 

S * * 

•x *x*x 


X s 

x*?c 

= 

pC- 

X 

■=r 

pc' 

1 


PL° 

1 

X 

= 

PC' 

1 

z 3 

— 

zr 2 

l 

IF 

=r 

*' 3 

• 


% 

\ 
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= 2 *+/ 


FIRST, I'LL EXPLAIN INVERSE 
FUNCTIONS USINO THE 
LINEAR FUNCTION y = 2x+l 
AS AN EXAMPLE. 


WHEN x IS 
ZERO, WHAT 
IS THE VALUE 
OF y? 



= 2x0+1 
* 0+1 
\ *1 




XT'S 1. 


HOW ABOUT 
WHEN x IS 3? 



y=2z+i 
= 2x3 + 1 
= 6 +/ 

* =7 




XT'S 7 


THE VALUE OF 
y PBPBNPB 
ON THE VALUE 
OF x. 


SO WE CALL y 
THE OUTCOMB, OR 
PEPENPENT VARIABLE, 
ANP x THE PPBPICTOP, 
OR INPEPENPENT 
VARIABLE. 


YOU COULP SAY THAT 
x IS THE BOSS OF y. 


I'M THIRSTY/ 


YOUR PRINK, 
SIR. 
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50 FOR THE EXAMPLE 
y = 2x+l, THE INVERSE 
FUNCTION IS... 






= 2 *+ 


- 2 » 


...ONE IN WHICH 
y ANP x HAVE 
SWITCHEP SEATS. 



= 2>-t 


T 


TKAN5P05E 



WE 

REORGANIZE 
THE FUNCTION 
LIKE THIS. 


YOU TRANSPOSEP 
IT ANP PIVIPEP BY 
2, SO NOW y IS 
ALONE. /6gi 


THAT'S RIGHT. TO 
EXPLAIN WHY THIS IS 
USEFUL, LET'S PRAW 
A GRAPH. 
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EXPONENTS ANP LOOAPITHMS 




OICAY... 

ON TO THE NEXT 
LEEEON. THESE APE 
CALLEP EXPONENTIAL 
FUNCTIONS. 


THEY ALL\ 
CROSS THE 
POINT CO,1) 
BECAUSE ANY 
NUMBER TO THE 
ZERO POWER 
IS 1. , 



r 



EXPONENTS ANP LOGARITHMS M 
















































































































































RULES FOR EXPONENTS 
ANP LOGARITHMS 




Let’s try this. We’ll confirm that (e a ) b and e axb are equal 
when a = 2 and b = 3. 


(e 2 ) 3 = e 2 xe 2 xe 2 = (exe)x(exe)x(exe) = exexexexexe = e 2 


This also means (e a ) b = e axb = (e b ) a . 



Now let’s try this, too. We’ll confirm that — and e a 
are equal when a = 3 and b = 5. 


e exexe _ 0x0x0 _ 1 _ ^_ 2 _ e s -5 

e 5 exexexexe exex 0x0x0 e 2 
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As mentioned page 20, y = logx and x = e y are equivalent. First we 
need to look at what a logarithm is. An exponential function of base 
b to a power, n, equals a value, x. The logarithm function inverts 
this process. That means the logarithm base b of a value, x, equals 
a power, n. 

We see that in log e (e a ) = n, the base b is e and the value x is e a , 
so e n = e a and n = a. 

So b n = x also means log b x = n. 

/ \ \ 

base value power 



Let’s confirm that log e (a b ) and b x log e (a) are equal. We’ll start by 
using b x log e (a) and e in the Power Rule: 


3 bxlog e (a) _ 


(e loge(a) ) l> 


And since e is the inverse of log , we can reduce e b log ""' on the 
right side to just a: 

gb x log e (a) _ Qb 

Now we’ll use the rule that b n = x also means log b x= n, where: 

b = e 
x = a b 

n = bx log (a) 

This means that e b logr ' :a = a g so we can conclude that log e (a b ) is 
equal to b x log e (a). 
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Let’s confirm that log e (a) + log e (b) and log e (a x b) are equal. Again, 
we’ll use the rule that states that b n = x also means log b x=n. 

Let’s start by defining e m = a and e n = b. We would then have 
e m e n = e m+n = a x b, thanks to the Product Rule of exponents. We can 
then take the log of both sides, 

log e (e m+ ") = log e (a x b), 

which on the left side reduces simply to 
m + rt = log e (a x b). 

We also know that m + n = log a + log e b, so clearly 
log e (a) + logjb) = log e (a x b). 


r 


HERE I HAVE SUMMARIZEP THE RULES 
I'VE EXPLAINEP SO FAR. 


RULE 1 

(e a ) b and e ab are equal. 

RULE Z 

e a 

and e a ~ b are equal. 

RULE 3 

a and log e (e a ) are equal. 

RULE 4 

log e (a b ) and b x log e (a) are equal. 

RULE 5 

log e (a) + log e (a) and log e (a x b) are equal. 



V_ J 

In fact, one could replace the natural number e in these equations with 
any positive real number d. Can you prove these rules again using d as 
the base? 


ii - _ " ....r- ..rr 
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PIFFFRFNTIAL CALCULUS 



IT LOOKS BAP, 
BUT IT'S NOT THAT 
HARP. I'LL EXPLAIN 
IT SO THAT YOU CAN 
UNPEPSTANP. 


TRUST ME, 
YOU'LL PO FINE. 
























MAKE THIS 
PATA INTO A 
SCATTER PLOT. 


OKAY, 

HOL.P 

ON 












HEIGHT 
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WHERE PIP THAT 
y = -^ +173.3 


FUNCTION COME FROM?/ 


THAT IE A REOREEEION 
EQUATION/ PONT WORRY 
ABOUT HOW TO OET IT 
RIOHT NOW. 


JUET AEEUME IT 
PEECRIBEE THE 
REUATIONEHIP 
BETWEEN YOUR 
AOE ANP YOUR 
HEIOHT. 



FOR NOW, I'LL JUET BELIEVE 
THAT THE REUATIONEHIP IE 
326.6 

y =-+ 173.3. 

x 


NOW, CAN YOU EEE 
THAT "7 YEARE OLP" 
CAN BE PEECRIBEP 
AE "(6 + 1) YEARE 
OLP"? 



WELL 

YEAH, 

THAT 

MAKES 

SENSE. 



EO UEINO THE EQUATION, YOUR INCREAEE IN 
HEIOHT BETWEEN AOE 6 ANP AOE (6 + 1) 
CAN BE PEECRIBEP AE... 


HEIGHT AT A<3£ (6 + 1) 


HEIGHT AT A<EE 6 


















WE CAN SHOW THE KATE OF 
OROWTH AE CENTIMETERE 
PER YEAR, EINCE THERE 
IE ONE YEAR BETWEEN 
THE AOEE WE UEEP. 




CM/YEAR 




WHAT IE AOE EIX 
ANP A HALF IN 
TERME OF THE 
NUMBER 6? 



LET ME EEE... 
(6 + 0.5) YEARE 
OLP? 




THE INCREAEE IN HEIOHT IN 0.5 YEARE, 
BETWEEN AOE 6 ANP AOE (6 + 0.5)... 


HEIGHT AT A65 
(6 + 0.5) 


HEIGHT AT A<3>5 6 





ANP THIE IE THE INCREAEE IN HEIOHT 
PER YEAR, BETWEEN AOE 6 ANP 
AOE (6 + 0.5). 




0 .£ 


a WYEAR 
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MATHEMATICS, 
WE USE THIS 
SYMBOL A 
CPELTA) TO 
REPRESENT 
CHANCE. 


IT PESCRIBES THE EXTREMELY 
SHORT PERIOP OF TIME BETWEEN 
THE AOE OF 6 ANP RIOHT AFTER 
TURNINO 6. USINO OUR EQUATION, 
WE CAN FINP THE CHANCE IN 
HEIOHT IN THAT PERIOP. 




THAT MEANS "THE INCREASE IN 
HEIOHT PER YEAR, BETWEEN 
ACE 6 ANP IMMEPIATELY 
AFTER TURNINC &' CAN BE 
PESCRIBEP LIKE THIS: 


/_3ZC£ 

\ (6tA) +IW,i 







CM/YEAR 














r j— 


3ZM 3lU 
(6+av 6 


3z£6 3ZL6 
6 (6+A) 


(Ua) i 


316*6 x 


6(6 +A) 


=r3z£,6x 


x— 


6(6+A) A 


-3l6.6x 


6C6+A) 


'3Z6.6x 


I 


6 ( 6 + 0 ) 


3zL6x —(rc/w/yeAR| 

6 



ARE YOU FOLLOWING 
SO FAR? THERE ARE A 
LOT OF STEPS IN THIS 
CALCULATION, BUT IT'S 
NOT TOO HARP, IS IT? 




NO, I THINK 
I CAN HANPLE 
THIS. 
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THE ANSWER IE 
326.6 x^- 





WE CAUL IT PIFFSRSNTIATIN&-HS IN 
PIFFERENTIAL CALCULUS. NOW WE 
HAVE A FUNCTION THAT PESCRIBES 
YOUR RATS OF OROWTH/ 


I PIP 

CALCULUS/ 



BY THE WAY, 
PERIVATIVES CAN BE 
WRITTEN WITH THE 
PRIME SYMBOL CO 
OR AS 
dy 

dx' 


— = £ 



NOW/1 CHALLENGE YOU 
TO TRY PIFFERENTIATINO 
OTHER FUNCTIONS. 
WHAT PO YOU SAY? 
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you LOOK 
NERVOUS. 
RELAX/ 



IN MATH, A MATRIX IS 
A WAY TO ORGANIZE A 
RECTANGULAR ARRAY 
OF NUMBERS. NOW I'LL 
GO OVER THE RULES 
OF MATRIX APPITION, 
MULTIPLICATION, ANP 
INVERSION. TAKE 
CAREFUL NOTES, 

OK AY? 
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A MATRIX CAN EE USEP TO WRITE EQUATIONS QUICKLY. 
JUST AS WITH EXPONENTS, MATHEMATICIANS HAVE 
RULES FOR WRITINO THEM. 


*1 + 2X 2 = 

3x, + 4x 2 = 5 


CAN EE WRITTEN AS 


1 2 
3 4 




5 

v y 



f x, + 2Xo 

1 2" 



i 2 CAN EE WRITTEN AS 




|3x 1 + 4x 2 

3 4 

v y 

v 2 y 

/ 


k, + 2k 2 + 3kg — — 3 



f 1 

2 

3 " 

(ic 


f _3 l 

4fcj + 5k 2 "t - 6kg = 8 

can be 


4 

5 

6 

\c n 


8 

10k, + llk 2 + 12k s = 2 

written as 


10 

11 

12 

2 

K 


2 

13k, + 14k 2 + 15Jc 3 = 7 



13 

V 

14 

15 

y 

v 3 ) 


7 

v y 


If you don’t know the values of the expressions, you write the expressions 
and the matrix like this: 


kj + 2k 2 + 3kg 

f 1 

2 

3 ^ 

4k, + 5k 2 + 6kg 

4 

5 

6 

7/Cj + 8k 2 + 9kg 

7 

8 

9 

10k, +llk 2 + 12kg 

10 

11 

12 

13k, + 14k 2 + 15kg 

13 

V 

14 

15 

y 


K 

v 3 y 


Just like an ordinary table, we 
say matrices have columns and 
rows . Each number inside of 
the matrix is called an element. 


(summary) 


a n Xi +ct 12 x 2 + --- + a lq x q = b, 

a„ a, 2 


f Xl ] 


f b 0 

®21 X 2 ®22 X 2 + ‘+ ®2q X q ~ ^2 Call be 

a 21 a 22 

tt 2q 

X 2 


b 2 

. written as 

a pl X l + a P 2 X 2 + • • • + a pq Xq = b p 

a pl a p2 

a pq y 



A, 


a „x, 

+ a, 2 x 2 + •■ 

■ + v, 



«12 

- a i,[ 

f*i) 

a 21 X 2 

+ a 22 x 2 + • 

’ ' + a 2q X q 

can be 

tt 21 

0I, 22 

tt 2q 

X 2 




written as 





a pl X l 

+ a p 2 x 2 + • 

- + a Pq \ 


, a ^ 

a p2 

a pq y 
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APPIN<3 MATRICES 


NEXT, I'LL EXPLAIN THE APPITION OF MATRICES. 
CONSIPER THIS: 


f 1 

2 n 

+ 

( 

4 

5 n 

3 

4 



-2 

4 

V 

y 


V 


/ 


\ 

5 


y 

1 

V 

00 


NOW JUST APP THE NUMBERS IN THE SAME 
POSITION: TOP LEFT PLUS TOP LEFT, ANP SO ON. 


r 1+4 2+5 

3 +(-2) 4 + 4 


YOU CAN ONLY APP MATRICES THAT HAVE 
THE SAME PIMENSIONS, THAT IS, THE SAME 
NUMBER OF ROWS ANP COLUMNS. 


EXAMPLE PROBLEM 1 


What is 


5 

i > 

+ 

r-t 

3 N 

6 

-9 


-3 

10 

V 

y 


V 

y 


ANSWER 


J 


5 1 'l 

+ 

f-1 

3 N 


5 + (-1) 

6 -9, 

V ) 


-3 

V 

10 

J 


6 +(-3) 


'4 4 ' 
3 1 


V 


J 


EXAMPLE PROBLEM 2 


What is 


f 1 

2 

3 ^ 


' 7 

2 

3 " 

4 

5 

6 


-1 

7 

-4 

7 

8 

9 

+ 

-7 

-3 

10 

10 

11 

12 


8 

2 

-1 

13 

V 

14 

15 

) 


7 

V 

1 

-9 

y 
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( ANSWER ) 


r i 

2 

3 N 


' 7 

2 

3 N 


' 1 + 7 

2 + 2 

3 + 3 "l 


' 8 

4 

6 > 

4 

5 

6 


-1 

7 

-4 


4 + (-1) 

5 + 7 

6 + (-4) 


3 

12 

2 

7 

8 

9 

+ 

-7 

-3 

10 

= 

7 + (-7) 

8 +(-3) 

9 + 10 

= 

0 

5 

19 

10 

11 

12 


8 

2 

-1 


10 + 8 

11 + 2 

12 + (-1) 


18 

13 

11 

13 

V 

14 

15 

J 


7 

V 

1 

-9 

J 


13 + 7 

V 

14 + 1 

15 +(-9 )j 


20 

V 

15 

6 

J 


^summary) 


a il 

a, 12 

- V 

( b 

u n 

^12 

' M 

Here are two generic 

«21 

a 22 

••• «2, 

t 

^21 

b 22 

' *>2q 

matrices. 



°"p2 

- 

t 

Pi 

kp2 

b p« / 



a n 

<*12 

- a iq ^ 


r b 

u n 

*>12 


You can add them 

a 21 

<*22 

<*2q 

+ 

^21 

*>22 

•• • 5 2q 

together, 











> 

<*p2 

a pq y 


^Pl b p2 

- b P« 


\ + b n 

°12 

+ b 12 

- ^ 

+ h \ 



like this: 

tt 21 b 21 

a 22 

+ 1*22 

<*2q 

+ b 2q 




a p i + b pl 

a p2 

+ • 


+ • 
■o®* 

hQ 

V-_ 




And of course, matrix subtraction works the same way. Just subtract 
the corresponding elements! 


MULTIPLYING MATRICES 


ON TO MATRIX MULTIPLICATION/ WE PON'T MULTIPLY 
MATRICES IN THE SAME WAY AS WE APP ANP 
SUBTRACT THEM. IT'S EASIEST TO EXPLAIN BY 
EXAMPLE, SO LET'S MULTIPLY THE FOLLOWING: 


1 2^ 

f Xl 

9l l 

l 3 4 J 


y 2y 


V J 
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WE MULTIPLY EACH ELEMENT IN THE FI PET COLUMN 
OF THE LEFT MATRIX BY THE TOP ELEMENT OF THE 
FI PET COLUMN IN THE PIOHT MATRIX, THEN THE EECONP 
COLUMN OF THE LEFT MATRIX BY THE EECONP 
ELEMENT IN THE FIRET COLUMN OF THE RIOHT MATRIX. 
THEN WE APP THE PROPUCTE, LIKE THIE: 

lx 1 + 2x 2 
3Xj + 4x 2 

ANP THEN WE PO THE EAME WITH THE 
EECONP COLUMN OF THE RIOHT MATRIX TO <3ET: 

lyi + 2 y 2 

3 y x + 4 y 2 



EO THE FINAL REEULT IE: 

s' 1 " ly,+2 y 2 ^ 

3y x + 4y 2 


lx x + 2x 2 
3Xj + 4x 2 


IN MATRIX MULTIPLICATION, FIRET YOU MULTIPLY 
ANP THEN YOU APP TO (3ET THE FINAL REEULT. 
LET'E TRY THIE OUT. 


EXAMPLE PROBLEM 1 



(1 

2^ 

r 4 

5 1 

What is 

3 

4 

J 

-2 

V 

4 

J 


We know to multiply the elements and then add the terms to simplify. 
When multiplying, we take the right matrix, column by column, and 
multiply it by the left matrix.* 


ANSWER 


j 


1 2^ 

r 4 ) 


"lx4 + 2x(-2)"| 


f°l 

V 3 4 y 

-2 j 


3 x 4 + 4 x (-2) , 

'v 1 ’ J 


v 4 y 


1 2^ 



"Ix5+2x4" 


13^ 

V 3 

4 j 


3x5 + 4x 4 

V J 


31 

V J 


First column 


Second column 


So the answer is 


1 

2^ 

' 4 

5^ 


0 

13^ 

3 

V 

4 

/ 

-2 

4 

J 


4 

31 

J 


* NOTE THAT THE RESULTING MATRIX WILL HAVE THE SAME NUMEER OF ROWS AS 
THE FIRST MATRIX ANP THE SAME NUMEER OF COLUMNS AS THE SECONP MATRIX. 
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EXAMPLE PROBLEM 2 


What is 


r 1 2 ^ 
4 5 

7 8 

10 11 


fci l, m l 


m 0 


( ANSWER ) 


f 1 

2 > 



' fci+2fc 2 

4 

5 

fK' 

\ 

4fc x + 5 k 2 

7 

8 

V 

/ 

7k, + 8 k 2 

10 

V 

11 

y 



10k , +llfc. 

r i 

2 ^ 



' li+2l 2 ^ 

4 

5 

ro 


4l 1+ 5l 2 

7 

8 

W 


7l 1+ 8l 2 

10 

V 

11 

J 



101, +lll 2 

v 1 2 J 


Multiply the first column of the 
second matrix by the respective 
rows of the first matrix. 


Do the same with the second 
column. 


f 1 

4 

2 N 

5 

f \ 

m i 

/ 

7 

8 

m 9 

— 

10 

V 

11 

J 

v 

V 


m 1 + 2m 2 
4m 1 + 5m 2 
7m 1 + 8m 2 
10m, +llm 0 


And the third column. 


The final answer is just a concatenation of the three answers above. 


K + 2 K 

4k, + 5k, 


l, + 21 2 m, + 2 m 2 
4lj + 51 2 4m x + 5m 2 

7^ + 8k 2 71, + 8l 2 7/rij + 8m 2 

lOfcj +1 lfc 2 lOlj +1 ll 2 10m! +1 lm 2 
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THE RULES OF MATRIX MULTIPLICATION 


WHEN MULTIPLYING MATRICES, THERE ARE THREE 
THINGS TO REMEMBER: 

* THE NUMBER OF COLUMNS IN THE FIRST 
MATRIX MUST EQUAL THE NUMBER OF ROWS 
IN THE SECONP MATRIX. 

* THE RESULT MATRIX WILL HAVE A NUMBER OF 
ROWS EQUAL TO THE FIRST MATRIX. 

* THE RESULT MATRIX WILL HAVE A NUMBER OF 
COLUMNS EQUAL TO THE SECONP MATRIX. 


Can the following pairs of matrices can be multiplied? 

If so, how many rows and columns will the resulting matrix have? 



EXAMPLE PROBLEM 1 


2 3 4 

-5 3 6 


^ 2 ^ 

-7 

0 , 

y 


( ANSWER ) 

Yes! The resulting matrix will have 2 rows and 1 column: 


(2 3 4^ 

-5 3 6 


^ 2 ^ 

-7 

vOy 


2 x 2 + 3 x (-7) + 4 x 0 ' 
(-5)x2 + 3x(-7) + 6x0 


-17 

V-31y 


EXAMPLE PROBLEM 2 


' 9 

4 

-c 

2 

-2 1 N 

7 

-6 

0 

4 

9 -7 

-5 

3 

8 

V 

J 



J 




( ANSWER ) 

No. The number of columns in the first matrix is 3, but the number of 
rows in the second matrix is 2. These matrices cannot be multiplied. 
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IPENTITY ANP INVERSE MATRICES 


THE LAST THINOS I'M (301 N<3 TO EXPLAIN TONIOHT 
APE IPENTITY MATRICES ANP INVERTS MATRICB5. 

AN IPENTITY MATRIX IS A SQUARE MATRIX WITH 
ONES ACROSS THE PIAOONAL, FROM TOP LEFT TO 
BOTTOM RI<3HT, ANP ZEROS EVERYWHERE ELSE. 


HERE IS A 2 x 2 IPENTITY MATRIX: 





1 0 0 

ANP HERE 15 A 3 x 3 IPENTITY MATRIX: 0 10 

0 0 1 


Some square matrices (a matrix that has the same number of rows as col¬ 
umns) are invertible. A square matrix multiplied by its inverse will equal 
an identity matrix of the same size and shape, so it’s easy to demonstrate 
that one matrix is the inverse of another. 

For example: 


1 2V -2 1 Wlx(-2) + 2xl.5 1 x 1 + 2x(-0.5) ] _( 1 0 
3 4 1.5 -0.5 P 3x(-2) +4x1.5 3xl + 4x(-0.5) “ 0 1 


So i 5 _Qg is the inverse of ^ 4 • 



WE'RE FINI5HEP 
FOR TOPAY 

























THANIC YOU 
SO MUCH FOR 
TEACHING ME. 







































STATISTICAL PATA TYPSS 


Now that you’ve had a little general math refresher, it’s time for a 
refreshing chaser of statistics , a branch of mathematics that deals 
with the interpretation and analysis of data. Let’s dive right in. 

We can categorize data into two types. Data that can be mea¬ 
sured with numbers is called numerical data , and data that cannot 
be measured is called categorical data. Numerical data is some¬ 
times called quantitative data , and categorical data is sometimes 
called qualitative data. These names are subjective and vary based 
on the field and the analyst. Table 1-1 shows examples of numerical 
and categorical data. 

TABLE M: NUMERICAL VS. CATEGORICAL PATA 



Number of 

books read 

per month 

Age 

(in years) 

Place where 

person most 

often reads 

Gender 

Person A 

4 

20 

Train 

Female 

Person B 

2 

19 

Home 

Male 

Person C 

10 

18 

Cafe 

Male 

Person D 

14 

22 

Library 

Female 


l 


A 

J 


Y 


Y 



Numerical 

Categorical 


Data 


Data 


Number of books read per month and Age are both examples 
of numerical data, while Place where person most often reads and 
Gender are not typically represented by numbers. However, cate¬ 
gorical data can be converted into numerical data, and vice versa. 
Table 1-2 gives an example of how numerical data can be converted 
to categorical. 


TABLE 1-2: CONVERTING NUMERICAL PATA TO CATEGORICAL PATA 



Number of 

books read 

per month 

Number of 

books read 

per month 

Person A 

4 

Few 

Person B 

2 

^ Few 

Person C 

10 

Many 

Person D 

14 

Many 
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In this conversion, the analyst has converted the values 1 to 5 
into the category Few , values 6 to 9 into the category Average , and 
values 10 and higher into the category Many . The ranges are up to 
the discretion of the researcher. Note that these three categories 
(Few, Average, Many) are ordinal , meaning that they can be ranked 
in order: Many is more than Average is more than Few. Some cate¬ 
gories cannot be easily ordered. For instance, how would one easily 
order the categories Brown, Purple, Green? 

Table 1-3 provides an example of how categorical data can be 
converted to numerical data. 


TABLE 1-3: CONVEPTINCS CATECSOPICAL PATA TO NUMERICAL PATA 



Favorite 

season 

Spring 

Summer 

Autumn 

Winter 

Person A 

Spring 

1 

0 

0 

0 

Person B 

Summer 

. 0 

1 

0 

0 

Person C 

Autumn 

” 0 

0 

1 

0 

Person D 

Winter 

0 

0 

0 

1 


In this case, we have converted the categorical data Favorite 
season , which has four categories (Spring, Summer, Autumn, 
Winter), into binary data in four columns. The data is described 
as binary because it takes on one of two values: Favorite is repre¬ 
sented by 1 and Not Favorite is represented by 0. 

It is also possible to represent this data with three columns. 
Why can we omit one column? Because we know each respondent’s 
favorite season even if a column is omitted. For example, if the first 
three columns (Spring, Summer, Autumn) are 0, you know Winter 
must be 1, even if it isn’t shown. 

In multiple regression analysis, we need to ensure that our data 
is linearly independent ; that is, no set of J columns shown can be 
used to exactly infer the content of another column within that set. 
Ensuring linear independence is often done by deleting the last col¬ 
umn of data. Because the following statement is true, we can delete 
the Winter column from Table 1-3: 

(Winter) = 1 - (Spring) - (Summer) - (Autumn) 

In regression analysis, we must be careful to recognize which 
variables are numerical, ordinal, and categorical so we use the vari¬ 
ables correctly. 
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HYPOTHESIS TESTING 


Statistical methods are often used to test scientific hypotheses. A 
hypothesis is a proposed statement about the relationship between 
variables or the properties of a single variable, describing a phe¬ 
nomenon or concept. We collect data and use hypothesis testing to 
decide whether our hypothesis is supported by the data. 

We set up a hypothesis test by stating not one but two hypoth¬ 
eses, called the null hypothesis (H o ) and the alternative hypothesis 
(HJ. The null hypothesis is the default hypothesis we wish to dis¬ 
prove, usually stating that there is a specific relationship (or none 
at all) between variables or the properties of a single variable. The 
alternative hypothesis is the hypothesis we are trying to prove. 

If our data differs enough from what we would expect if the null 
hypothesis were true, we can reject the null and accept the alter¬ 
native hypothesis. Let’s consider a very simple example, with the 
following hypotheses: 

H o : Children order on average 10 cups of hot chocolate per month. 

Hj Children do not order on average 10 cups of hot chocolate per 

month. 

We’re proposing statements about a single variable—the num¬ 
ber of hot chocolates ordered per month—and checking if it has a 
certain property: having an average of 10. Suppose we observed five 
children for a month and found that they ordered 7, 9, 10, 11, and 
13 cups of hot chocolate, respectively. We assume these five chil¬ 
dren are a representative sample of the total population of all hot 
chocolate-drinking children. The average of these five children’s 
orders is 10. In this case, we cannot prove that the null hypothesis 
is false, since the value proposed in our null hypothesis (10) is 
indeed the average of this sample. 

However, suppose we observed a sample of five different chil¬ 
dren for a month and they ordered 29, 30, 31, 32, and 35 cups of hot 
chocolate, respectively. The average of these five children’s orders 
is 31.4; in fact, not a single child came anywhere close to drinking 
only 10 cups of hot chocolate. On the basis of this data, we would 
assert that we should reject the null hypothesis. 

In this example, we’ve stated hypotheses about a single vari¬ 
able: the number of cups each child orders per month. But when 
we’re looking at the relationship between two or more variables, 
as we do in regression analysis, our null hypothesis usually states 
that there is no relationship between the variables being tested, 
and the alternative hypothesis states that there is a relationship. 
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MEAEURIN6 VARIATION 


Suppose Miu and Risa had a karaoke competition with some friends 
from school. They competed in two teams of five. Table 1-4 shows 
how they scored. 

TABLE 1-4: KARAOKE ECOREE FOR TEAM MIU ANP TEAM RIEA 


Team member Score 

Team member Score 

Miu 

48 

Risa 

67 

Yuko 

32 

Asuka 

55 

Aiko 

88 

Nana 

61 

Maya 

61 

Yuki 

63 

Marie 

71 

Rika 

54 

Average 

60 

Average 

60 


There are multiple statistics we can use to describe the “center” 
of a data set. Table 1-4 shows the average of the data for each team, 
also known as the mean . This is calculated by adding the scores of 
each member of the group and dividing by the number of members 
in the group. Each of the karaoke groups has a mean score of 60. 

We could also define the center of these data sets as being the 
middle number of each group when the scores are put in order. This 
is the median of the data. To find the median, write the scores in 
increasing order (for Team Miu, this is 32, 48, 61, 71, 88) and the 
median is the number in the middle of this list. For Team Miu, 
the median is Maya’s score of 61. The median happens to be 61 
for Team Risa as well, with Nana having the median score on this 
team. If there were an even number of members on each team, we 
would usually take the mean of the two middle scores. 

So far, the statistics we’ve calculated seem to indicate that the 
two sets of scores are the same. But what do you notice when we 
put the scores on a number line (see Figure 1-1)? 



0 10 20 30 40 50 60 70 80 90 100 

Score 


Team Risa 










i\a 

na 




Asuka 



i- 

Yuki 



Rika 




Risa 

1 1 1 


1 | 











0 10 20 30 40 50 60 70 80 90 100 


Score 


FIGURE 1-1: KARAOKE ECOREE FOR TEAM MIU ANP TEAM RIEA ON NUMBER UNEE 
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Team Miu’s scores are much more spread out than Team 
Risa’s. Thus, we say that the data sets have different variation . 

There are several ways to measure variation, including the sum 
of squared deviations, variance, and standard deviation. Each of 
these measures share the following characteristics: 

# All of them measure the spread of the data from the mean. 

# The greater the variation in the data, the greater the value of the 
measure. 

# The minimum value of the measures is zero—that happens only 
if your data doesn’t vary at all! 


EUM OF 6<2UA££[? PEVIATIONE 

The sum of squared deviations is a measure often used during 
regression analysis. It is calculated as follows: 

sum of (individual score - mean score) 2 , 

which is written mathematically as 

Z(*-*) 2 - 

The sum of squared deviations is not often used on its own to 
describe variation because it has a fatal shortcoming—its value 
increases as the number of data points increases. As you have 
more and more numbers, the sum of their differences from the 
mean gets bigger and bigger. 

VARIANCE 

This shortcoming is alleviated by calculating the variance: 

I (*-*) 2 

n _l , where n = the number of data points. 

This calculation is also called the unbiased sample variance , 
because the denominator is the number of data points minus 1 
rather than simply the number of data points. In research studies 
that use data from samples, we usually subtract 1 from the number 
of data points to adjust for the fact that we are using a sample of the 
population, rather than the entire population. This increases the 
variance. 

This reduced denominator is called the degrees of freedom, 
because it represents the number of values that are free to vary. 

For practical purposes, it is the number of cases (for example, obser¬ 
vations or groups) minus 1. So if we were looking at Team Miu and 
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Team Risa as samples of the entire karaoke-singing population, 
we’d say there were 4 degrees of freedom when calculating their 
statistics, since there are five members on each team. We subtract 
1 from the number of singers because they are just a sample of all 
possible singers in the world and we want to overestimate the vari¬ 
ance among them. 

The units of the variance are not the same as the units of the 
observed data. Instead, variance is expressed in units squared, in 
this case “points squared.” 

5TANPAPP PEVIATION 

Like variance, the standard deviation shows whether all the data 
points are clustered together or spread out. The standard deviation 
is actually just the square root of the variance: 

^variance 

Researchers usually use standard deviation as the measure of 
variation because the units of the standard deviation are the same 
as those of the original data. For our karaoke singers, the standard 
deviation is reported in “points.” 

Let’s calculate the sum of squared deviations, variance, and 
standard deviation for Team Miu (see Table 1-5). 


TABLE 1-5: MEA5UKINO VARIATION OF 5COKE5 FOR TEAM MIU 


Measure of variation 

Calculation 

Sum of squared 

(48 - 60) 2 + (32 - 60) 2 + (88 - 60) 2 + (61 - 60) 2 + (71 - 60) 2 

deviations 

= (-12) 2 + (—28) 2 + 28 2 + l 2 + ll 2 


= 1834 

Variance 

1834 = 458.8 


5-1 

Standard deviation 

V458.5 =21.4 

Now let’s do the same for Team Risa (see Table 1-6). 

TABLE 1-6: MEASURING VARIATION OF 6CORE6 FOR TEAM RISA 

Measure of variation 

Calculation 

Sum of squared 

(67 - 60) 2 + (55 - 60) 2 + (61 - 60) 2 + (63 - 60) 2 + (54 - 60) 2 

deviations 

= 7 2 + (—5) 2 +1 2 + 3 2 + (—6) 2 


= 120 

Variance 

1^ = 30 


5-1 

Standard deviation 

•J30 = 5.5 
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We see that Team Risa’s standard deviation is 5.5 points, 
whereas Team Miu’s is 21.4 points. Team Risa’s karaoke scores 
vary less than Team Miu’s, so Team Risa has more consistent 
karaoke performers. 


PROBABILITY PEN5ITY FUNCTIONS- 

We use probability to model events that we cannot predict with 
certainty. Although we can accurately predict many future events— 
such as whether running out of gas will cause a car to stop run¬ 
ning or how much rocket fuel it would take to get to Mars—many 
physical, chemical, biological, social, and strategic problems are so 
complex that we cannot hope to know all of the variables and forces 
that affect the outcome. 

A simple example is the flipping of a coin. We do not know all 
of the physical forces involved in a single coin flip—temperature, 
torque, spin, landing surface, and so on. However, we expect that 
over the course of many flips, the variance in all these factors will 
cancel out, and we will observe an equal number of heads and tails. 
Table 1-5 shows the results of flipping a billion quarters in number 
of flips and percentage of flips. 


TABLE 1-5: TALLY OF A BILLION OOIN FLIPS 



Number of flips 

Percentage of flips 

Heads 

499,993,945 

49.99939% 

Tails 

500,006,054 

50.00061% 

Stands on its edge 

1 

0.0000001% 


As we might have guessed, the percentages of heads and tails 
are both very close to 50%. We can summarize what we know about 
coin flips in a probability density function, P(x), which we can apply 
to any given coin flip, as shown here: 

P(Heads) = .5, P(Tails) = .5, P(Stands on its edge) < 1 x 10 9 

But what if we are playing with a cheater? Perhaps someone has 
weighted the coin so that P(x) is now this: 

P(Heads) = .3, P(Tails) = .7, P(Stands on its edge) = 0 

What do we expect to happen on a single flip? Will it always be 
tails? What will the average be after a billion flips? 
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Not all events have so few possibilities as these coin examples. 
We often wish to model data that can be continuously measured. 

For example, height is a continuous measurement. We could mea¬ 
sure your height down to the nearest meter, centimeter, millimeter, 
or . . . nanometer. As we begin dealing with data where the possi¬ 
bilities lie on a continuous space, we need to use continuous func¬ 
tions to represent the probability of events. 

A probability density Junction allows us to to compute the prob¬ 
ability that the data lies within a given range of values. We can plot 
a probability density function as a curve, where the x-axis repre¬ 
sents the event space , or the possible values the result can take, 
and the y-axis is/(x), or the probability density function value of x . 
The area under the curve between two possible values represents 
the probability of getting a result between those two values. 

NORMAL PISTRI&UTI OH5 

One important probability density function is the normal distri¬ 
bution (see Figure 1-2), also called the bell curve because of its 
symmetrical shape, which researchers use to model many events. 



FIOURF 1-2: A NORMAL PISTRI3UTION 

The standard normal distribution probability density function 
can be expressed as follows: 


/(*) = 


1 


a/2tt 


e 2 
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The mean of the standard normal distribution function is zero. 
When we plot the function, its peak or maximum is at the mean 
and thus at zero. The tails of the distribution fall symmetrically 
on either side of the mean in a bell shape and extend to infinity, 
approaching, but never quite touching, the x-axis. The standard 
normal distribution has a standard deviation of 1. Because the 
mean is zero and the standard deviation is 1, this distribution is 
also written as N(0,1). 

The area under the curve is equal to 1 (100%), since the value 
will definitely fall somewhere beneath the curve. The further from 
the mean a value is, the less probable that value is, as represented 
by the diminishing height of the curve. You may have seen a curve 
like this describing the distribution of test scores. Most test takers 
have a score that is close to the mean. A few people score excep¬ 
tionally high, and a few people score very low. 

£HI-6>QUA£EP PI5TPI3UTION5 

Not all data is best modeled by a normal distribution. The chi- 
squared (x 2 ) distribution is a probability density function that fits 
the distribution of the sum of squares. That means chi-squared 
distributions can be used to estimate variation. The chi-squared 
probability density function is shown here: 


/(*) = 


k k , 

— /•CO-1 

2 


■ X X 


2 


x e 


— /• 00 

2 2 f x 

Jo 

0 , 


e x dx 


x>0 
x< 0 


The sum of squares can never be negative, and we see that J{x) 
is exactly zero for negative numbers. When the probability density 
function of x is the one shown above, we say, “x follows a chi- 
squared distribution with k degree(s) of freedom.” 

The chi-squared distribution is related to the standard normal 
distribution. In fact, if you take Z 1? Z 2 , . . . , Z fc , as a set of indepen¬ 
dent, identically distributed standard normal random variables and 
then take the sum of squares of these variables like this, 

X = Z\ + Z\ + - + z£, 

then X is a chi-squared random variable with k degrees of freedom. 
Thus, we will use the chi-squared distribution of k to represent 
sums of squares of a set of k normal random variables. 
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In Figure 1-3, we plot two chi-squared density curves, one for 
k = 2 degrees of freedom and another for k = 10 degrees of freedom. 

When k = 10 


When k = 2 




FIOURE 1-3: CHI-SQUAREP PENSITY CURVES FOR 2 PEOREES OF FREEPOM CLEFT) ANP 10 PEOREES OF 
FREEPOM CRIOHT) 


Notice the differences. What is the limit of the density functions 
as x goes to infinity? Where is the peak of the functions? 

PROBABILITY PBN5ITY PISTRIBUTION TABLES 

Let’s say we have a data set with a variable X that follows a chi- 
squared distribution, with 5 degrees of freedom. If we wanted to 
know for some point x whether the probability P of X > x is less 
than a target probability—also known as the critical value of the 
statistic—we must integrate a density curve to calculate that prob¬ 
ability. By integrate , we mean find the area under the relevant por¬ 
tion of the curve, illustrated in Figure 1-4. 



FIGURE 1-4: THE PROBABILITY P THAT A VALUE X EXCEEPS 
THE CRITICAL CHI-SOUAREP VALUE x 
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Since that is cumbersome to do by hand, we use a computer or, 
if one is unavailable, a distribution table we find in a book. Distri¬ 
bution tables summarize features of a density curve in many ways. 
In the case of the chi-squared distribution, the distribution table 
gives us the point x such that the probability that X > x is equal to 
a probability P. Statisticians often choose P = .05, meaning there is 
only a 5% chance that a randomly selected value of X will be greater 
than x . The value of P is known as a p-value. 

We use a chi-squared probability distribution table (Table 1-6) 
to see where our degrees of freedom and our p-value intersect. This 
number gives us the value of x 2 (our test statistic). The probability 
of a chi-squared of this magnitude is equal to or less than the p at 
the top of the column. 


TABLE 1-6: CHI-SQUAREP PROBABILITY DISTRIBUTION TABLE 


, .995 

degrees N. 

of freedom 

.99 

.975 

.95 

.05 

.025 

.01 

.005 

1 

0.000039 

0.0002 

0.0010 

0.0039 

3.8415 

5.0239 

6.6349 

7.8794 

2 

0.0100 

0.0201 

0.0506 

0.1026 

5.9915 

7.3778 

9.2104 

10.5965 

3 

0.0717 

0.1148 

0.2158 

0.3518 

7.8147 

9.3484 

11.3449 

12.8381 

4 

0.2070 

0.2971 

0.4844 

0.7107 

9.4877 

11.1433 

13.2767 

14.8602 

5 

0.4118 

0.5543 

0.8312 

1.1455 

11.0705 

12.8325 

15.0863 

16.7496 

6 

0.6757 

0.8721 

1.2373 

1.6354 

12.5916 

14.4494 

16.8119 

18.5475 

7 

0.9893 

1.2390 

1.6899 

2.1673 

14.0671 

16.0128 

18.4753 

20.2777 

8 

1.3444 

1.6465 

2.1797 

2.7326 

15.5073 

17.5345 

20.0902 

21.9549 

9 

1.7349 

2.0879 

2.7004 

3.3251 

16.9190 

19.0228 

21.6660 

23.5893 

10 

2.1558 

2.5582 

3.2470 

3.9403 

18.3070 

20.4832 

23.2093 

25.1881 


To read this table, identify the k degrees of freedom in the first 
column to determine which row to use. Then select a value for p. 

For instance, if we selected p = .05 and had degrees of freedom 
k = 5, then we would find where the the fifth column and the fifth 
row intersect (highlighted in Table 1-6). We see that* = 11.0705. 
This means that for a chi-squared random variable and 5 degrees 
of freedom, the probability of getting a draw X = 11.0705 or greater 
is .05. In other words, the area under the curve corresponding to 
chi-squared values of 11.0705 or greater is equal to 11% of the total 
area under the curve. 

If we observed a chi-squared random variable with 5 degrees 
of freedom to have a value of 6.1, is the probability more or less 
than .05? 
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F PI5TPI3UTION5 

The F distribution is just a ratio of two separate chi-squared dis¬ 
tributions, and it is used to compare the variance of two samples 
As a result, it has two different degrees of freedom, one for each 
sample. 

This is the probability density function of an F distribution: 


/(*) = 


p oo 

Jo 


x 2 e X dx 


x(Ui ) 2 x(u 2 )l 


<•00 -L-l 

f x 2 e x dx 
Jo 


/• oo 

Jo 


x 2 e X dx 


(v, x X + V 2 ) 2 


-, x > 0 


J 


o, 


x < 0 


If the probability density function of X is the one shown above, 
in statistics, we say, “X follows an F distribution with degrees of 
freedom v x and u 2 .” 

When v l = 5 and v 2 = 10 and when v 1 = 10 and v 2 = 5, we get 
slightly different curves, as shown in Figure 1-5. 




FIGURE 1-5: F PISTPI5UTION PSNSITY CURVES FOR 5 ANP 10 RESPECTIVE PEGREES OF FREEPOM (LEFT) 
ANP 10 ANP 5 RESPECTIVE PEGREES OF FREEPOM CPIOHT) 

Figure 1-6 shows a graph of an F distribution with degrees of 
freedom v x and v 2 . This shows the F value as a point on the hori¬ 
zontal axis, and the total area of the shaded part to the right is 
the probability P that a variable with an F distribution has a value 
greater than the selected F value. 
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t 

F(first degree of freedom, second degree of freedom; P) 
FIGURE 1-6: THE PPOEABILITY P THAT A VALUE x EXCEEP5 THE CRITICAL F VALUE 


Table 1-7 shows the F distribution table when p = .05. 
TAELE 1-7: F PPOEAEILITY PI5TPIEUTION TAELE FOR p = .05 


"a ^ 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

1 

161.4 

199.5 

215.7 

224.6 

230.2 

264.0 

236.8 

238.9 

240.5 

241.9 

2 

18.5 

19.0 

19.2 

19.2 

19.3 

19.3 

19.4 

19.4 

19.4 

19.4 

3 

10.1 

9.6 

9.3 

9.1 

9.0 

8.9 

8.9 

8.8 

8.8 

8.8 

4 

7.7 

6.9 

6.6 

6.4 

6.3 

6.2 

6.1 

6.0 

6.0 

6.0 

5 

6.6 

5.8 

5.4 

5.2 

5.1 

5.0 

4.9 

4.8 

4.8 

4.7 

6 

6.0 

5.1 

4.8 

4.5 

4.4 

4.3 

4.2 

4.1 

4.1 

4.1 

7 

5.6 

4.7 

4.3 

4.1 

4.0 

3.9 

3.8 

3.7 

3.7 

3.6 

8 

5.3 

4.5 

4.1 

3.8 

3.7 

3.6 

3.5 

3.4 

3.4 

3.3 

9 

5.1 

4.3 

3.9 

3.6 

3.5 

3.4 

3.3 

3.2 

3.2 

3.1 

10 

5.0 

4.1 

3.7 

3.5 

3.3 

3.2 

3.1 

3.1 

3.0 

3.0 

11 

4.8 

4.0 

3.6 

3.4 

3.2 

3.1 

3.1 

2.9 

2.9 

2.9 

12 

4.7 

3.9 

3.5 

3.3 

3.1 

3.0 

2.9 

2.8 

2.8 

2.8 


Using an F distribution table is similar to using a chi-squared 
distribution table, only this time the column headings across the 
top give the degrees of freedom for one sample and the row labels 
give the degrees of freedom for the other sample. A separate table 
is used for each common p-value. 

In Table 1-7, when v i = 1 and v 2 = 12, the critical value is 4.7. 

This means that when we perform a statistical test, we calculate 
our test statistic and compare it to the critical value of 4.7 from this 
table; if our calculated test statistic is greater than 4.7, our result 
is considered statistically significant . In this table, for any test sta¬ 
tistic greater than the number in the table, the p-value is less than 
.05. This means that when v i = 1 and v 2 = 12, the probability of an 
F statistic of 4.7 or higher occurring when your null hypothesis is 
true is 5%, so there’s only a 5% chance of rejecting the null hypoth¬ 
esis when it is actually true. 
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Let’s look at another example. Table 1-8 shows the F distribu¬ 
tion table when p = .01. 

TABLE 1-8: F PROBABILITY PISTRIBUTION TABLE FOR p = .01 


V 2 


1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

1 


4052.2 

4999.3 

5403.5 

5624.3 

5764.0 

5859.0 

5928.3 

5981.0 

6022.4 

6055.9 

2 


98.5 

99.0 

99.2 

99.3 

99.3 

99.3 

99.4 

99.4 

99.4 

99.4 

3 


34.1 

30.8 

29.5 

28.7 

28.2 

27.9 

27.7 

27.5 

27.3 

27.2 

4 


21.2 

18.8 

16.7 

16.0 

15.5 

15.2 

15.0 

14.8 

14.7 

14.5 

5 


16.3 

13.3 

12.1 

11.4 

11.0 

10.7 

10.5 

10.3 

10.2 

10.1 

6 


13.7 

10.9 

9.8 

9.1 

8.7 

8.5 

8.3 

8.1 

8.0 

7.9 

7 


12.2 

9.5 

8.5 

7.8 

7.5 

7.2 

7.0 

6.8 

6.7 

6.6 

8 


11.3 

8.6 

7.6 

7.0 

6.6 

6.4 

6.2 

6.0 

5.9 

5.8 

9 


10.6 

8.0 

7.0 

6.4 

6.1 

5.8 

5.6 

5.5 

5.4 

5.6 

10 


10.0 

7.6 

6.6 

6.0 

5.6 

5.4 

5.2 

5.1 

4.9 

4.8 

11 


9.6 

7.2 

6.2 

5.7 

5.3 

5.1 

4.9 

4.7 

4.6 

4.5 

12 


9.3 

6.9 

6.0 

5.4 

5.1 

4.8 

4.6 

4.5 

4.4 

4.3 


Now when v l = 1 and v 2 = 12, the critical value is 9.3. The prob¬ 
ability that a sample statistic as large or larger than 9.3 would 
occur if your null hypothesis is true is only .01. Thus, there is a 
very small probability that you would incorrectly reject the null 
hypothesis. Notice that when p = .01, the critical value is larger 
than when p = .05. For constant v i and v 2 , as the p-value goes down, 
the critical value goes up. 


PROBABILITY PENSITY FUNCTIONS Sq 






J)cy V3(|f 


z 

em ?ie 

ANALYSIS 






















































































































































FIP5T5TEPS 63 





















' ALL RIGHT 
THEN, LET'S GO! 

THIS TABLE SHOWS THE 
HI(3H TEMPERATURE ANP 
THE NUMBER OF ICEP 
TEA ORPERS EVERY PAY 
. FOR TWO WEEKS. 


High temp. (°C) 

Iced tea orders 

22nd (Mon.) 

29 

77 

23rd (Tues.) 

28 

62 

24th (Wed.) 

34 

93 

25th (Thurs.) 

31 

84 

26th (Fri.) 

25 

59 

27th (Sat.) 

29 

64 

28th (Sun.) 

32 

80 

29th (Mon.) 

31 

75 

30th (Tues.) 

24 

58 

31st (Wed.) 

33 

91 

1st (Thurs.) 

25 

51 

2nd (Fri.) 

31 

73 

3rd (Sat.) 

26 

65 

4th (Sun.) 

30 

84 



PLOTTING The PATA 




ZO Z1 ZZ Z3 Z4 25 Z6 Z7 ZS 23 30 31 3Z 33 34 35 
HI6H TEMP C°a 


•LIKE THIE. 
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IEEE. 


SEE HOW THE POTS 
ROU6HLY LINE UP? THAT 
SUGGESTS THEEE VARIABLES 
ARE CORRELATED THE 
CORRELATION COEFFICIENT, 
CALLEP R, INPICATEE 
HOW STRON6 THE 
V CORRELATION IE. ^ 


R = 0.9069 


R RANGES FROM +1 TO 
-1, ANP THE FURTHER IT IE 
FROM ZERO, THE STRONGER 
THE CORRELATION* I'LL EHOW 
YOU HOW TO WORK OUT THE 
CORRELATION COEFFICIENT 
ON PAGE 78. 


* A POSITIVE R VALUE INPIOATES A 
POSITIVE RELATIONSHIP, MEANING AS 
x INCREASES, SO POES y. A NEGATIVE 
R VALUE MEANS AS THE x VALUE 
INCREASES, THE y VALUE PECREASES. 















































































here, r ie uw.ee, 
inpicatino i cep 

TEA REALLY POEE 
EELL BETTER ON 
HOTTER PAYE. 
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THE REGRESSION EQUATION 


BASICALLY, 
THE OOAL OF 
REORESSI ON 
ANALYSIS IS... 



...TO OBTAIN THE 
REGRESSION EQUATION... 


...IN THE FORM OF 
y = ax + b. 



100 

% 

r ?.=| 

• 


V * 1 

• • 

g* 

• • y 

• 

t* So 

* 

• 

• 

• 

A C 

53 

• 

• 




• 

s& 

•/ . 

• • 

a 6o 

• yr 

"—t—1_*_._I—I—I . ...... J 

* 

/ 


So 

L ,i_ . . . t ~^» . ... . , . _ 


0 .—.—,—i—,—i—i—i—i—■—i—.—i—i—u_i 50 .—■—.—i—'— 1 —'—■—'—*—■—<—'—■—<—* 

7021 22 23 24 2S& 2328 733031 &33343S 7021 22 23 24 25 26 23 28 733031 3233343$ 


HI<SH TEMP. CO 


HkSH TEMP. CO 



IF YOU INPUT A HIGH 
TEMPERATURE FOR x... 


3 b5 \ 
* to\ 


,'^-Cix+b 


$ 

ft 

So 




• 

> 

% 

/ 


24 7 $ 76 27 7 S 23 30 . 


...YOU CAN PREPICT 
HOW MANY 
ORPERS OF ICEP 
TEA THERE WILL 
BE Cyl . 
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x see/ ReoRessioN 
ANALYSIS POeSN'T 
SeeM TOO HARP. 



AS X SAIP eARLieR, y IS THe \ 
PEPENPENT COR OUTCOME") 

VARIABLe ANP x IS THe J 
INPEPENPENT COR PPEPICTOP)/ 

^_ VARIABLE 

*u = ax + b 

f t 

PEPENPENT VARIABLE INPEPENPENT VARIABLE 


a IS THe RBORBSSION COBFFICIBNT, 
WHICH TeLLS US THe SLOPB OF 
THe UNe we ma ice. 


that ueAves 
US WITH b, THe 
INTBRCBPT. THIS 
TeLLS US WHBRB 

our UNe cRosses 
THe Y-axis. ^ 



FINPINO THe 
eOUATION IS 
ONLY PART OF 
THe STORY 





you also Neep to 

LeARN HOW TO VSRIFY 
THe ACCURACY OF 
YOUR eOUATION BY 
TFSTINO FOR CFRTAIN 
CIRCUMSTANCeS. LBT'S 
LOOK AT THe PROCeSS 
AS A WHOLS. 
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GENERAL REGRESSION 
ANALYSIS PROCEDURE 


HERE 5 AN 
OVERVIEW OF 
REGRESSION 
ANALYSIS. 


STEP 1 


DRAW A SCATTER PLOT OF THE INDEPENDENT VARIABLE 
VERSUS THE DEPENDENT VARIABLE. IF THE DOTS LINE UP, 
THE VARIABLES MAY BE CORRELATED. 



STEP 2 


CALCULATE THE REGRESSION EQUATION. 




STEP 3 ^ 

CALCULATE THE CORRELATION COEFFICIENT 00 AND 
ASSESS OUR POPULATION AND ASSUMPTIONS. 

WHAT'S R? ; 

* 




STEPS 


CALCULATE THE CONFIDENCE INTERVALS. 


STEP 6 


MAKE A PREDICTION/ 
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KSGKSSSION 

DIAGNOSTICS 


























IT'S EASIER TO 
EXPLAIN WITH AN 
EXAMPLE. LET'S USE 
SALES PATA FROM 
NORNS. 


STEP 1: PRAW A SCATTER PLOT OF THE INPEPENPENT 
VARIABLE VERSUS THE PEPENPENT VARIABLE. IF THE 
POTS LINE UP THE VARIABLES MAY BE CORRELATED 


FIRST, PRAW A 
SCATTER PLOT OF THE 
INPEPENPENT VARIABLE 
ANP THE PEPENPENT 
VARIABLE. 


20 21 22 23 24 25 26 27 28 24 3 
HIOH TEMP C°c\ 


WE'VE 
PONE THAT 
ALREAPY. 
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icep tea orders 


WHEN WE PLOT 
EACH PAY'S HIGH 
TEMPERATURE AGAINST 
ICEP TEA ORPERS, THEY 
SEEM TO LINE UP. 


ANP WE KNOW FROM 
EARLIER THAT THE 
VALUE OF R IS 04064, 
WHICH IS PRETTY y 
HIGH. X 




IT LOOKS LIKE 
THESE VARIABLES 
ARE CORRELATED 


PO YOU REALLY 
LEARN ANYTHING 
FROM ALL 
THOSE POTS? 
WHY NOT JUST 
CALCULATE R? 



ZO 21 ZZ 23 Z4 Z5 Z6 27 28 24 30 31 3Z 33 34 35 
HI<EH TEMP CO 


THE SHAPE 
OF OUR 
PATA IS 
IMPORTANT/ 



20 21 22 23 24 25 26 27 28 Z4 30 31 32 33 34 35 
HIC3H TEMP CO 


LOOK AT THIS CHART. RATHER 
THAN FLOWING IN A LINE, 
THE POTS ARE SCATTEREP 
RANPOMLY. 



20 21 22 23 24 25 26 27 28 24 30 31 32 33 34 35 
HIGH TEMP CO 


YOU CAN STILL FINP A x 
REGRESSION EQUATION, 
BUT IT'S MEANINGLESS. THE 
LOW R VALUE CONFIRMS 
IT, BUT THE SCATTER PLOT 
LETS YOU SEE IT WITH 
YOUR OWN EYES. 


ALWAYS PRAW A 
PLOT FIRST TO 
GET A SENSE OF 
THE PATA'S SHAPE. 



OH, I SEE. 
PLOT5...ARE... 
IMPORTANT! 
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STEP Z: CALCULATE THE REGRESSION EQUATION. 


3|'3=<w+i>j 


NOW, LETS MAKE 
A REGRESSION 
EQUATION/ . 



" LET'S PRAW A ' 
STRAIGHT LINE, 
FOLLOWING THE 
PATTERN IN THE PATA 
AS BEST WE CAN. 


/A 2S 25 27 2S a JO 3/ 32 •» & X 
HIOH TEMP. C 0 O 


THE LITTLE ARROWS ARE 
THE PITTANCES FROM THE 
LINE, WHICH REPRESENTS 
THE ESTIMATEP VALUES 
OF EACH POT, WHICH ARE 
THE ACTUAL MEASUREP 
VALUES. THE PISTANCES 
ARE CALLEP RB5IPUAL.5. 
THE COAL IS TO FINP THE 
LINE THAT BEST MINIMIZES 
, ALL THE RESIPUALS. 


THIS IS CALLEP 
UNBAR LBABT 
BQUARBB 
RB&RB55ION. 


'WB SQUARE THE 
RESIPUALS TO 
FINP THE BUM OF 
BQUARBB, WHICH 
WE USE TO FINP 
THE REGRESSION 
v EQUATION. > 



K^l Calculate S xx (sum of squares of z' 

x ), S yy (sum of squares of y), and / I'LL APP 
Sv U (sum of products of x and y). ( THIS TO MY 

_ M _ y \ NOTES. 

Calculate S e (residual sum of 

squares). 

Differentiate S e with respect to a sraF|. 
and b, and set it equal to 0. 

Separate out a and b. 

Isolate the a component. J\!r< ® 

Find the regression equation. 
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Find 

* The sum of squares of x, S xx : (x - x) 2 

0 The sum of squares of y, S yy : (y - y) 2 

0 The sum of products of x and y, S xy : (x -x)(y- y) 

Note: The bar over a variable (like x ) is a notation that means 
average. We can call this variable x-bar. 



High temp. 

in °C 

X 

Iced tea 

orders 

y 

x - X 

yy 

(x-xf 

(y-yf 

(x-x){y-y) 

22nd (Mon.) 

29 

77 

-0.1 

4.4 

0.0 

19.6 

-0.6 

23rd (Tues.) 

28 

62 

-1.1 

-10.6 

1.3 

111.8 

12.1 

24th (Wed.) 

34 

93 

4.9 

20.4 

23.6 

417.3 

99.2 

25th (Thurs.) 

31 

84 

1.9 

11.4 

3.4 

130.6 

21.2 

26th (Fri.) 

25 

59 

-4.1 

-13.6 

17.2 

184.2 

56.2 

27th (Sat.) 

29 

64 

-0.1 

-8.6 

0.0 

73.5 

1.2 

28th (Sun.) 

32 

80 

2.9 

7.4 

8.2 

55.2 

21.2 

29th (Mon.) 

31 

75 

1.9 

2.4 

3.4 

5.9 

4.5 

30th (Tues.) 

24 

58 

-5.1 

-14.6 

26.4 

212.3 

74.9 

31st (Wed.) 

33 

91 

3.9 

18.4 

14.9 

339.6 

71.1 

1st (Thurs.) 

25 

51 

-4.1 

-21.6 

17.2 

465.3 

89.4 

2nd (Fri.) 

31 

73 

1.9 

0.4 

3.4 

0.2 

0.8 

3rd (Sat.) 

26 

65 

-3.1 

-7.6 

9.9 

57.8 

23.8 

4th (Sun.) 

30 

84 

0.9 

11.4 

0.7 

130.6 

9.8 

Sum 

408 

1016 

0 

0 

129.7 

2203.4 

484.9 


Average 29.1 72.6 


X y S xx Syy S X y 


* SOMS OF THS FIOUPSS IN THIS CHAPTSP APS POUNPSP 
FOP THS SAKS OF PPINTINO, SUT CASCUSATIONS APS 
PONS USINO THS FULL, UNPOUNPSP VAUUSS PSSUUTINO 
FPOM THS PAW PATA UNLSSS OTHSPWISS STATSP. 
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Find the residual sum of squares, S e . 

# y is the observed value. 

# y is the the estimated value based on our regression equation. 

# y - y is called the residual and is written as e. 

Note: The caret in y is affectionately called a hat, so we call this 
parameter estimate y-hat. 


High 
temp, 
in °C 


Actual iced 
tea orders 


Predicted 
iced tea 
orders 


Residuals (e) 


Squared residuals 



X 

y 

b 

j = 

- ax + b 


y 

- 

y 




( y 


yf 



22nd (Mon.) 

29 

77 

a 

X 

29 + 

b 

77 

-( a 

X 

29 

+ 

b) 

[77 

~ (« 

X 

29 

+ 

bW 

23rd (Tues.) 

28 

62 

a 

X 

28 + 

b 

62 

-(a 

X 

28 

+ 

b) 

[62 

-(a 

X 

28 

+ 

b)Y 

24th (Wed.) 

34 

93 

a 

X 

34 + 

b 

93 

-(a 

X 

34 

+ 

b) 

[93 

-(a 

X 

34 

+ 

b)Y 

25th (Thurs.) 

31 

84 

a 

X 

31 + 

b 

84 

-(a 

X 

31 

+ 

b) 

[84 

-(a 

X 

31 

+ 

b)Y 

26th (Fri.) 

25 

59 

a 

X 

25 + 

b 

59 

-(a 

X 

25 

+ 

b ) 

[59 

-(a 

X 

25 

+ 

b)P 

27th (Sat.) 

29 

64 

a 

X 

29 + 

b 

64 

-(a 

X 

29 

+ 

b ) 

[64 

- (« 

X 

29 

+ 

b)P 

28th (Sun.) 

32 

80 

a 

X 

32 + 

b 

80 

-(a 

X 

32 

+ 

b) 

[80 

-(a 

X 

32 

+ 

b)Y 

29th (Mon.) 

31 

75 

a 

X 

31 + 

b 

75 

-(a 

X 

31 

+ 

b) 

[75 

-(a 

X 

31 

+ 

b)Y 

30th (Tues.) 

24 

58 

a 

X 

24 + 

b 

58 

-(a 

X 

24 

+ 

b) 

[58 

-(a 

X 

24 

+ 

b)Y 

31st (Wed.) 

33 

91 

a 

X 

33 + 

b 

91 

-(a 

X 

33 

+ 

b ) 

[91 

-(a 

X 

33 

+ 

b)Y 

1st (Thurs.) 

25 

51 

a 

X 

25 + 

b 

51 

-(a 

X 

25 

+ 

b) 

[51 

-(a 

X 

25 

+ 

b)Y 

2nd (Fri.) 

31 

73 

a 

X 

31 + 

b 

73 

-(a 

X 

31 

+ 

b) 

[73 

-(a 

X 

31 

+ 

b)Y 

3rd (Sat.) 

26 

65 

a 

X 

26 + 

b 

65 

-(a 

X 

26 

+ 

b) 

[65 

-(a 

X 

26 

+ 

b)Y 

4th (Sun.) 

30 

84 

a 

X 

30 + 

b 

84 

-(a 

X 

30 

+ 

b) 

[84 

- (« 

X 

30 

+ 



Sum 

408 

1016 

408a + 14b 1016 - (408a + 14b) 

S e < -I 


Average 

29.1 

72.6 

29.1a + b 72.6-(29.1a+ b) 

_s e 



1 

i 

= xa + b = y - (xa + b) 

14 



1 

X 

9 

S e = [77 - (a x 29 + b)] 2 H- 1 - [84 - (a x 30 + b 

)T 


THE EUM OP THE REEIPUALE EQUAREP IE 
CALIEP THE RESIPUAL SUM OF SQUARES. 
IT IE WRITTEN AE S P OR i?SS. 
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Differentiate S e with respect to a and b, and set it equal to 0. 

When differentiating y = (ax + b) n with respect to x, the result is 

— = n(ax + b) xa 
dx K) 

• Differentiate with respect to a. 

JQ 

= 2[77 - (29a + b)] x (-29) + • • • + 2[84 - (30a + b)] x (-30) = 0 O 

' Differentiate with respect to b. 

JQ 

= 2[77 - (29a + b)] x (-1) + • • • + 2[84 - (30a + b)] x (-1) = 0 © 


Rearrange O and 0 from the previous step. 

Rearrange O. 

2 [77 - (29a + b)] x (-29) + • • • + 2[84 - (30a + b)] x (-30) = 0 

[77 - (29a + b)] x (-29) + • • • + [84 - (30a + b)] x (-30) = 0 pivipe both Sipes BY 2. 

29 [(29a + b) - 77] + • • + 30 [(30a + b) - 84] = 0 multiply by -i. 

(29 x 29a + 29 x b - 29 x 77) H-h (30 x 30a + 30 x b - 30 x 84) = 0 MULTIPLY. 

© (29 2 4- 1 - 30 2 )a + (29 h- v 30)b-(29x 77 + —I- 30x 84) = 0 ^^mvb^ 

Rearrange ©. 

2 [77 - (29a + b)] x (-1) + • ■ • + 2[84 - (30a + b)] x (-1) = 0 

[77 -(29a + b)]x (-1) + •• • + [84-(30a + b)]x(-1) = 0 pivipe both Sipes by 2 . 

[(29a + b) - 77] + • • • + [(30a + b) - 84] = 0 multiply by - 1 . 

(29 + • • • + 30)a + b+-+b - (77 + • • • + 84) = 0 ^^MVb^ 



14 



O b = 


77 H-h 84 29 H-h 30 

-a 


ISOLATE b ON THE LEFT SWB OF THE EQUATION. 


14 14 


0 b = y -xa 


THE COMPONENTS IN O APB THE 
AVERAGES OF y ANP x. 
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Plug the value of b found in 0 into line © (© and 0 are the results 
from Step 4). 

0 

©(29^... + 30 2 )a + (29 + ... + 30)rZZ±_±M-^±_±^a]-(29x77 + ... + 30x84) = 0 ONLY VARIABLE. 


1 to 
CD 

+ 

• + 30 2 )a + 

(29 2 4- ■ 

•• + 30 2 ) ^ 

(29 2 + • 

•• + 30 2 ) ^ 


14 14 

(29 + - + 30)(77 + - + 84) (29 + - + 30) : 


14 


14 


a- 29x77 +••• + 30x84 =0 


29+-+30 


14 

29+-+30V 


14 


(29 + • ■• + 30)(77 + •■■ + 84) . /V)MRIWP THF 

a + -- — -- - (29 x 77 h -b 30 x 84) = 0 

14 V 'a TERMS. 

, x (29 H-h 30)(77 + ••■ + 84) 

a = (29 x 77 + • • • + 30 x 84) - --- TRANSPOSE. 


14 


| Rearrange the left side of the equation. | 

(29 + ••• + 30) 2 


(29 + --- + 30 -- 


= (29 2 + ••• + 30 2 )-2> 


14 

(29+-+30) 2 (29+-+30) 


14 


14 


■ 


WE APP ANP SUBTRACT 


(29 + --- + 30) 

14 


= (29 2 + ••• + 30 2 ) - 2x(29 + ••• + 30)x 29 + ‘ + 30 +^ 29 + ‘ + 30 


14 


14 


THE LAST TERM IS 


xl4 14 

MULTIPLIEP BY —. 

14 


= (29 2 + ••• + 30 2 )-2x(29 + ••• + 30)xx + (x) 2 xl4 - = 29 + - + 30 

14 

= (29 2 + —h 30 2 ) - 2 x (29 + —h 30) x x + (x) 2 + —f (x) 2 

14 

= ^29 2 - 2 x 29 x x + (x) 2 J H- 1 -1^3 


- 1 - 30 - 2 x 30 x 


x + (x) 2 ] 


= s„ 


| Rearrange the right side of the equation."! 

(29 + ••• +30)(77 + --- + 84) 


(29 x 77 + • • • + 30 x 84) 


14 


/ \ 29 + • • • + 30 77 + • • • + 84 

= (29x77 +••• + 30x84)-x-xl4 


14 


14 


= (29x77+ ••• +30x84)-xxyxl4 

= (29x77 + --- + 30x84)-xxyxl4-xxyxl4 + xxyxl4 WE APP ANP SUBTRACT xxyxl4 


29 + + 30 77 + + 84 

= (29x77 +••• + 30x84)--xyxl4-xx-xl4 + xxyxl4 

V ) 14 y 14 y 

= (29 x 77 + • • • + 30 x 84) - (29 + • • • + 30) y - x (77 + -- - + 84) + xxyxl4 

= (29 x 77 + • • • + 30 x 84) - (29 + • • • + 30) y - (77 + • • • + 84) x + x x y + • • • + x x y 

14 

= (29-x)(77-j/) + - + (30-x)(84-y) 

= S„. 


s a = S 

XX *“jey 

s 

© a = ISOLATE a ON THE LEFT SIPE OF THE EQUATION. 

O 
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Calculate the regression equation. 


From © in Step 5, a = —. From © in Step 4, b = y- xa. 

S xx 

If we plug in the values we calculated in Step 1, 


484.9 0 _ 

a = = -= 3.7 

S xy 129.7 

b = y - xa = 72.6 - 29.1 x 3.7 = -36.4 


then the regression equation is 


y = 3.7x-36.4. 


It’s that simple! 


Note: The values shown are rounded for the sake of printing, but 
the result (36.4) was calculated using the full, unrounded values. 



20 V n 23 24 is 76 TUB 29 BO 31 3Z33&3& 
Hk&H TEMP. C 0 O 


WE PIP IT/ 
WE ACTUALLY 
PIP IT/ 






THE RELATIONSHIP BETWEEN THE 
RESIPUALS ANP THE SLOPE a ANP 
INTERCEPT b IS ALWAYS 

_ sum of products of x and y _ & xy 
sum of squares of x 
b = y - xa 

THIS IS TRUE FOR ANY 
LINEAR REGRESSION. 
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SO, MIU, WHAT 
ARE THE AVERAGE 
VALUES FOR THE 
HIGH TEMPERATURE 
ANP THE ICEP TEA 
ORPERS? 


REMEMBER, 

THE AVERAGE 
TEMPERATURE ISx 
ANP THE AVERAGE 
NUMBER OF ORPERS 
IS y . NOW FOR A , 
LITTLE MAGIC. / 






20 21 7Z 23 24 2$ 26 11 72 2? 30 31 3Z & M if 

y. 

high temp. c°a 

/ WITHOUT Xyl) ' 

f LOOKING, X CAN \ / (7| l 

TELL YOU THAT THE ) ( J U /) 

REGRESSION EQUATION/ V /M/' 

vCROSSES THE POINT/ /TV ^ /y * ' 

X CZQ.l 7Z.61 /it JJ/Tf* 


THE REGRESSION 
EQUATION CAN BE... 


THAT'S FROM STEP, 


= ai-Kf-za) 

= au-Tp +jf 


..REARRANGEP 
LIKE THIS. 


I SEE/ 



' NOW, IF WE 
SET x TOTHS 
AVERAGE VALUE 
Cx)WE FOUNP 
v BEFORE... 


a(7t-7L)^ 


SEE WHAT 
HAPPENS? 


' WHEN x IS THE 
AVERAGE, SO IS y/ 
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ICBV TEA OKPEKE 


STEP 3: CALCULATE THE CORRELATION COEFFICIENT OO ANP 
ASSESS OUR POPULATION ANP ASSUMPTIONS. 


NEXT WE'LL 
PETERMINE THE 
ACCURACY OF 
THE REGRESSION 
EQUATION WE HAVE 
COME UP WITH. , 



THE POTS ARE 
CLOSER TO THE 
REGRESSION LINE 
IN THE LEFT GRAPH. 


RIGHT/ 
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WHEN A RECESSION 
EQUATION IE 
ACCURATE, THE 
ESTIMATE? VALUES 
CTHE LINE) ARE 
CLOSER TO THE 
OBSERVE? VALUES 
C?OTS). 



RIGHT. ACCURACY 
IS IMPORTANT, BUT 
?ETERMINING IT BY 
LOOKING AT A GRAPH 
IS PRETTY SUBJECTIVE. 


'THE POTS 
ARE 

CLOSE. , 



f THE POTS\ 
' ARE MNP 
V OF FAR./ 



YES, THAT'S TRUE. 








RIGHT.' WE USE R TO REPRESENT AN 
INPEX THAT MEASURES THE ACCURACY 
OF A REGRESSION EQUATION. THE INPEX 
COMPARES OUR PATA TO OUR PREPICTIONS 
IN OTHER WORPS, THE MEASUREP x ANP y 
TO THE ESTIMATEP x ANP y. 



R IS ALSO CALLS? 

THE PEAREON 
PROPUCT MOMENT 
CORRELATION 
COEFFICIENT 
IN HONOR OF 
MATHEMATICIAN 
KARL PEARSON. > 
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HEKE'S THE EQUATION. 
WE CALCULATE THESE 
LIKE WE PIP S xx ANP 
S xy ESEFOKE. . 


lira ^ 


sum of products y and y 
/sum of squares of y x sum of squares of y 
1812.3 _ 


= 0.9069 


/2203.4x 1812.3 


S x S- 

/ yy yy 


N 

THAT'S NOT 
TOO BAP/ 



THIS LOOKS 
FAMILIAL 


-REGRESSION FUNCTION/ 



Actual 

values 

y 

Estimated / 

values w 

y = 3.7x - 36.4 

y-y 

y-y 

(y-y ) 2 

(y-~yf 

(. yy)(yy ) 

(y-y ) 2 

22nd (Mon.) 

77 

72.0 

4.4 

-0.5 

19.6 

0.3 

-2.4 

24.6 

23rd (Tues.) 

62 

68.3 

-10.6 

-4.3 

111.8 

18.2 

45.2 

39.7 

24th (Wed.) 

93 

90.7 

20.4 

18.2 

417.3 

329.6 

370.9 

5.2 

25th (Thurs.) 

84 

79.5 

11.4 

6.9 

130.6 

48.2 

79.3 

20.1 

26th (Fri.) 

59 

57.1 

-13.6 

-15.5 

184.2 

239.8 

210.2 

3.7 

27th (Sat.) 

64 

72.0 

-8.6 

-0.5 

73.5 

0.3 

4.6 

64.6 

28th (Sun.) 

80 

83.3 

7.4 

10.7 

55.2 

114.1 

79.3 

10.6 

29th (Mon.) 

75 

79.5 

2.4 

6.9 

5.9 

48.2 

16.9 

20.4 

30th (Tues.) 

58 

53.3 

-14.6 

-19.2 

212.3 

369.5 

280.1 

21.6 

31st (Wed.) 

91 

87.0 

18.4 

14.4 

339.6 

207.9 

265.7 

16.1 

1st (Thurs.) 

51 

57.1 

-21.6 

-15.5 

465.3 

239.8 

334.0 

37.0 

2nd (Fri.) 

73 

79.5 

0.4 

6.9 

0.2 

48.2 

3.0 

42.4 

3rd (Sat.) 

65 

60.8 

-7.6 

-11.7 

57.3 

138.0 

88.9 

17.4 

4th (Sun.) 

84 

75.8 

11.4 

3.2 

130.6 

10.3 

36.6 

67.6 

Sum 

1016 

1016 

0 

0 

2203.4 

1812.3 

1812.3 

391.1 


Average 


S e ISN'T NECESSARY FOR 
CALCULATING i?, BUT IINCLUPEP 
IT BECAUSE WE'LL NEEP IT LATER. 
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IF WE SQUARE R, 
IT'S CALLEP THE 
COEFFICIENT OF 
DETERMINATION ANP 
IE WRITTEN AS i? 2 . 


R 2 CM EE AN 
INPICATOR OF... 


I i am a 

j / AM A COefp, LATIOn 

\sbss? \ ™ “> 


I AM A 

COEFFICIENT OF 
. PETERMI NATION. 


...HOW MUCH 
VARIANCE IE 
EXPLAINEP BY 
OUR REGRESSION 
^ EQUATION. ^ 



AN R 2 OF ZERO INPICATEE THAT 
THE OUTCOME VARIABLE CAN'T 
BE RELIABLY PREPICTEP FROM 
THE PREPICTOR VARIABLE. ^ 


0- 


THE HIGHER THE 
ACCURACY OF THE 
REGRESSION EQUATION, 
THE CLOEER THE R 2 VALUE 
IE TO 1, ANP VICE VEREA. 


EO HOW HIGH POEE R 2 
NEEP TO BE FOR THE 
REGRESSION EQUATION 
TO BE CONEIPEREP / 
ACCURATE? / 



UNFORTUNATELY, THERE 
IE NO UNIVEREAL 
ETANPARP IN ETATIETICE. 


BUT GENERALLY WE 
WANT A VALUE OF AT 
LEAET 5. 


r 0 $ 


LOWEST... 

. 5 ... 


NOW TRY FINPING THE 
VALUE OF R 2 . 



= 0.2225 


rr'E .sz2E. 
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TH£ VALUE 


OF K 2 FOR OUR \ iW 

REGRESSION EQUATION $ 

IS WELL OVER .5, SO OUR / £'° 
EQUATION SHOULP EE / t* 5? 

ABLE TO ESTIMATE ICEP ) 

TEA ORPERS RELATIVELy / S ^ 
AccURATELy. / !: ‘' 

>3.7z-36.4 


20 II a U 24 1$ 26 2? IS 2? 30 31 3Z 33 & if 

HI(2>H TEMP. C°0 


correlation 

coefficient 


Y JOT THIS EQUATION 

POWN. J ? 2 CAN BE 
CALCULATEP PIRECTLy 
FROM THESE VALUES. 
USINO OUR NORNS PATA, 

1 - (391.1 / 2203.4) = 

V .8225/ 


SAMPLES ANP POPULATIONS 


axS xu s 

_ xy_ _ _ e 

S - s 

yy yy 



THAT'S 
HANPV/ 


''''' 


V - 1 

V 1 !^' I 1 '//, 



WE'VE FINISHEP THE 
FIRST THREE STEPS. 


^ sS\\ \ 

'X~J .A \l HOORAY' 


I MEANT TO ASK 
YOU ABOUT THAT. 
WHAT POPULATION? 
JAPAN? EARTH? . 


ACTUALLY, THE 
POPULATION 
WE'RE TALKINO 
ABOUT ISN'T 
PEOPLE- 
IT'S PATA 
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High temp. (°C) 

Iced tea orders 

22nd (Mon.) 

29 

77 

23rd (Tues.) 

28 

62 

24th (Wed.) 

34 

93 

25th (Thurs.) 

31 

84 

26th (Fri.) 

25 

59 

27th (Sat.) 

29 

64 

28th (Sun.) 

32 

80 

29th (Mon.) 

31 

75 

30th (Tues.) 

24 

58 

31st (Wed.) 

33 

91 

1st (Thurs.) 

25 

51 

2nd (Fri.) 

31 

73 

3rd (Sat.) 

26 

65 

4th (Sun.) 

30 

84 







...THESE THREE PAYS 
ARE NOT THE ONLY 
PAYS IN HISTORY 
WITH A HIOH OF 31°C, 
ARE THEY? 



THERE MUST HAVE BEEN 
MANY OTHERS IN THE 
PAST, ANP THERE WILL 
BE MANY MORE IN THE 
FUTURE, RIOHT? 





































































































POPULATION 


THESE THREE 
PAYS ARE A 
SAMPLE... 



ALL PAYS WITH HIGH f 
TEMPERATURE OF 31° 4 


SAMPLING // 

—- -_ SAMPLE 7 




%£>* 


FOR PAYS WITH THE 
SAME NUMBER OF 
ORPERS, THE POTS 
^ ARE STACKEP. 


* r b 


...FROM THE POPULATION OF ALL 
PAYS WITH A HIGH TEMPERATURE OF 
31°£. WE USE SAMPLE PATA WHEN IT'S 
UNLIKELY WE'LL 3E A3LE TO GET THE 
INFORMATION WE NEEP FROM EVERY 
SINGLE MEM3ER OF THE POPULATION. 


THAT MAKES 
SENSE. 


POPULATION 


:;!Hli3H:OF.28 c 


POPULATION 


ijiHI0H:PF=; : 26?:iji 


POPULATION 


HIOH:PF ; 25v 


SAMPLE 


POPULATION 


!;!HIGH:'OFr.;24? ; 


SAMPL E 

POPULATION 


SHIGHOF,30.°, 


SAMPLE 


SAMPLE 


SAMPLE 


POPULATION 


S HIGH OF-32°. 


SAMPLE 


POPULATION 


^ 


POPULATION 


SHUSH :0F:33? ; 


SAMPLE 


sample population 


C C J S$s 


SAMPLE 




SHUSH:OF.. : 34x 



SAMPLES 
REPRESENT THE 
POPULATION. 
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ASSUMPTIONS OF NORMALITY 


a regression 

EQUATION IE 
MEANINGFUL ONLY 
IF A CERTAIN 
HYPOTHESIS IS 
VIABLE. 




ALTERNATIVE HYPOTHESIS 

THE NUMBER OF ORPERS OF ICEP TEA ON 
PAYS WITH TEMPERATURE x°C FOLLOWS A 
NORMAL PISTRIBUTION WITH MEAN Ax+B ANP 
STANPARP PEVIATION c CSIGMA). 



LET'S TAKE IT SLOW. 
FIRST LOOK AT THE 
SHAPES ON THIS GRAPH. 


'-'by# 


SAME 

SHAPE 


^ THESE SHAPES 
REPRESENT THE ENTIRE 
POPULATION OF I CEP TEA 
ORPERS FOR EACH HIGH 
TEMPERATURE. SINCE WE 
CAN'T POSSIBLY KNOW THE 
EXACT PISTRIBUTION FOR EACH 
TEMPERATURE, WE HAVE TO 
ASSUME THAT THEY MUST ALL 
BE THE SAME: A NORMAL, 
BELL-SHAPEP CURVE. ^ 


















"MUST ALL BE 
THE SAME"? 


WON'T THE 
PISTRIBUTIONS 
BE SLIGHTLY 
PIFFERENT? 


® $ 



4 a <^ 


COULP THEY PIFFER 
ACCORPING TO 
TEMPERATURE? 


GOOP 

POINT. 



THEY'RE NEVER 
EXACTLY THE SAME. 



BUT WE MUST ASSUME 
THAT THEY ARE/ 
REGRESSION PEPENPS 
ON THE ASSUMPTION 
OF NORMALITY/ 




JUST BELIEVE IT- 
OKAY? 


I CAN 
PO THAT. 



BY THE WAY Ax+B IS CALLEP 
THE POPULATION REGRESSION. 
THE EXPRESSION ax+b IS THE 
SAMPLE REGRESSION. > 
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STEP A: CONPUCT THE ANALYSIS OF VARIANCE. 





A, LIKE a, 

15 A ELOPE. 

B, LIKE b, 15 AN 
INTERCEPT. ANP a 
15 THE 5TANPARP 
PEVIATION. 


A, B, ANP a ARE 
COEFFICIENTS 
OF THE ENTIRE 
POPULATION. 





IF THE REGRESSION EQUATION IS 




a SHOULP BE CLOSE TO A 

b SHOULP BE CLOSE TO B 

I S e SHOULP BE 

V number of individuals - 2 CLOSE TO (7 


PO you RECALL a, b, 
ANP THE 5TANPARP 
PEVIATION FOR 0% 
OUR NORN5 p*S 
. PATA? 


T well, the 

REGRESSION 
EQUATION WAS 
y = 3.7x - 36.4, 
SO... 


• A IS ABOUT 3.7 

• B IS ABOUT -36.4 


o IS ABOUT 


IS THAT RIGHT? 






391.1 391.1 


PERFECT/ 
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"CLOSE TO" SEEMS 
SO VAGUE. CAN'T WE 
FINP A, B, ANP a WITH 
MOKE CERTAINTY? , 



" SINCE A, B, ANP a ARE 
COEFFICIENTS OF THE 
POPULATION, WE'P 
NEEP TO USE ALL THE 
NORNS ICEP TEA ANP 
HIGH TEMPERATURE PATA 
THROUGHOUT HISTORY/ WE 
COULP NEVER GET IT ALL. 




...WE CAN PETERMINE 
ONCE ANP FOR ALL 
WHETHER A = 0/ 




\J| ^ 


YOU SHOULP 
LOOK MORE 
EXCITEP/ THIS 
IS IMPORTANT/ 
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THAT WOULP 
MA KB THIS 
PREAPEP 
HYPOTHESIS 
TRUE/ 



NUU Hypothesis 

Of ORPeij, 

WTH Hl/s,u 






A* 0 ^ ; , ,;, 

iA" / i--'.' .--- 

<(&y\\ ,y \ y\ 

V-X/ i I ' J 




■‘fe 




A=0 


\,/?r .SAMPLE RE0KEEEIPN'' 

^ i v- »ri—/ 

'sC"' L 'V .' 


/ / 


IF THE SLOPE A = 0, THE LINE IS 
HORIZONTAL. THAT MEANS I0EP 
TEA ORPERS ARE THE SAME, 
NO MATTER WHAT THE HIOH 
TEMPERATURE IS/ 


THE TEMPERATURE _jf 
POESN'T MATTER' 


HOW PO WE 
FINP OUT 
ABOUT A? 



7 WE CAN PO AN Y 
ANALYSIS OF 
VARIANCE CANOVAAl 


LET'S PO THE 
ANALYSIS ANP SEE 
WHAT FATE HAS IN 
STORE FOR A. 
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THE ETEP5 OF ANOVA 


Step 1 

Step 2 

Step 3 

Step 4 
Step 5 


Step 6 


Step 7 


Define the population. 

Set up a null hypothesis and 
an alternative hypothesis. 

Select which hypothesis test 
to conduct. 

Choose the significance level. 
Calculate the test statistic 
from the sample data. 


The population is “days with a high temperature of 
x degrees.” 

Null hypothesis is A = 0. 

Alternative hypothesis is A * 0. 

We’ll use analysis of one-way variance. 

We’ll use a significance level of .05. 

The test statistic is: 


number of individuals - 2 


Plug in the values from our sample regression 
equation: 


3.7 


129.7 


391.1 

14-2 


55.6 


Determine whether the 
p-value for the test statistic 
obtained in Step 5 is smaller 
than the significance level. 
Decide whether you can reject 
the null hypothesis. 


The test statistic will follow an F distribution 
with first degree of freedom 1 and second degree of 
freedom 12 (number of individuals minus 2), if the null 
hypothesis is true. 

At significance level .05, with d x being 1 and d 2 being 12, 
the critical value is 4.7472. Our test statistic is 55.6. 


Since our test statistic is greater than the critical value, 
we reject the null hypothesis. 


THE F STATISTIC LETS US TEST 
THE SLOPE OF THE LINE EY 
LOOKINO AT VARIANCE. IF THE 
VARIATION AROUNP THE LINE 
IS MUCH SMALLER THAN THE 
TOTAL VARIANCE OF Y, THAT'S 
EVIDENCE THAT THE LINE 
ACCOUNTS FOR Y'S VARIATION, 
ANP THE STATISTIC WILL EE 
LAROE. IF THE RATIO IS SMALL, 
THE LINE POESN'T ACCOUNT 
FOR MUCH VARIATION IN Y, ANP 
PROEAELY ISN'T USEFUL/ 



50 A ^ 0, 
WHAT A R5U5F! 
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STEP 5: CALCULATE THE CONFIDENCE INTERVALE. 



NOW, LET'S TAKE 
A CLOSER LOOK 
AT HOW WELL 
OUR REGRESSION 
EQUATION 
REPRESENTS THE 
v POPULATION. v 
V 



IN THE 

POPULATION... 


/ ...LOTS OF N 
/ PAYS HAVE A 
/ HIGH OF 31°C, 
ANP THE NUMBER 
OF ICEP TEA 
ORPERS ON 
THOSE PAYS 
VARIES. OUR 
REGRESSION 
EQUATION 
PREPICTS ONLY 
ONE VALUE 
FOR ICEP TEA 
, ORPERS AT THAT 
\ TEMPERATURE. 


HOW PO WE 
KNOW THAT 
IT'S THE RIGHT 
VALUE? 



WE CAN'T KNOW 
FOR SURE. WE 
CHOOSE THE MOST 
LIKELY VALUE: THE 
POPULATION MEAN. 


IF THE POPULATION HAS A 
NORMAL PISTRIBUTION... 

PAYS WITH A HIGH 
OF 31 °C CAN EXPECT 
APPROXIMATELY THE 
MEAN NUMBER OF ICEP 
TEA ORPERS. WE CAN'T 
KNOW THE EXACT MEAN, 
BUT WE CAN ESTIMATE 
A RANGE IN WHICH IT 
MIGHT FALL. > 


(j/?\ 

&/l \ 


/T\ / 


\Jf\ Hi 




MAXIMUM 

MEAN 

OKVBK5 


ReezeeeioH 
- EQUATION 


+ — -H *—\-- MINIMUM 

' !/ \ MFAN 

/ \ OKVBK5 

THF MEAN NUMBFfc 
OF OKPFK5 15 
50MFWHFKF IN HFKF. 







HUH? THE RANGES 
PIFFER, PEPENPING 
ON THE VALUE OF xl 










































WE CALCULATE AN 
INTERVAL FOR EACH 
TEMPERATURE. 


AE YOU NOTICE!?, THE 
WIPTH VARIEE. IT'E 
EMALLER NEAR x, WHICH 
IE THE AVERAGE HIGH 
TEMPERATURE VALUE. 



EVEN THIE INTERVAL 
IENT ABEOLUTELY 
GUARANTEEP TO 
CONTAIN THE TRUE 
POPULATION MEAN. 
OUR CONFIPENCE 
IE PETERMINEP BY 
THE CONFIPENCE 
COEFFICIENT. 


SOUNPS 

^ FAMIUIAW 


E NO ORPINARY 
COEFFICIENT. 


THERE IE NO 
EQUATION TO 
CALCULATE IT, 
NO EET RULE. 


YOU CHOOEE 
THE CONFIPENCE 
COEFFICIENT, ANP 
YOU CAN MAKE IT 
ANY PERCENTAGE 
YOU WANT. ^ 



WHEN CALCULATING A CONFIPENCE 
INTERVAL, YOU CHOOEE THE 
CONFIPENCE COEFFICIENT FIRET. 


YOU WOULP THEN EAY "A 42% 
CONFIPENCE INTERVAL FOR ICEP TEA 
ORPERE WHEN THE TEMPERATURE IE 31°C 
IE 30 TO 35 ORPERE," FOR EXAMPLE/ 



I CHOOSE?) 
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WELL, MOST 
PEOPLE 

CHOOSE EITHER 
<35% OR <3<3%. 


HEY, IF OUR 

CONFIPENCE IS BASEP 
ON THE COEFFICIENT, 
ISN'T HIGHER BETTER?^ 



WELL, NOT 
NECESSARILY. 


TRUE, OUR CONFIPENCE 
IS HIGHER WHEN WE 
CHOOSE <W%, BUT THE 
INTERVAL WILL BE MUCH 
LARGER, TOO. 


99 % 


9 £% 


THE NUMBER OF ORPERS 
OF ICEP TEA IE ALMOST 
CERTAINLY BETWEEN 
\ O ANP 120/ 




NOW, SHALL WE 
CALCULATE THE 
CONFIPENCE INTERVAL 
FOR THE POPULATION 
OF PAYS WITH A HIGH 
TEMPERATURE OF 31°C? 


YES, 

LET'S/ 


THE NUMBER OF ORPERS 
OF ICEP TEA IE PROBABLY 
. BETWEEN 40 ANP SO/ 


THAT'S NOT 
Sl^PRISINO. 


SWEET/ 




HOWEVER, IF THE 
CONFIPENCE COEFFICIENT 
IS TOO LOW, THE RESULT IS 
NOT CONVINCING. ^ 


HMM, I SEE. 
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HERE'S HOW TO CALCULATE A 45% 
CONFIPENCE INTERVAL FOR ICEP TEA 
ORPER5 ON PAYS WITH A HIOH OF 31°C. 


This is the confidence interval. 





Number of 
orders of iced tea 


■> 


79.5* - 3.9 = 75.6 31 x a + b 79.5 + 3.9 = 83.4 

= 31 x 3.7 - 36.4 
= 79.5 


Distance from the estimated mean is 

1 
= 3 

where 
two cf 

/ 

F(l,n-2;.05)x 

\ 

/ -\2A 

1 . (*o-*) .. S e 

” S xx J n-2 

F (1,14-2; .05) x 

.9 

n is the numbei 
li-squared distri 

1 (31 - 29.l) 2 'j 391.1 

14 + 129.7 14-2 

V J 

: of data points in our sample and F is a ratio of 
buttons, as described on page 57. 


TO CALCULATE A 49% CONFIPENCE INTERVAL, 
JUST CHANCE 


F(l,14 - 2;.05) = 4.7 
TO 

F(l,14 - 2;.01) = 9.3 


(REFER TO PACE 58 FOR AN EXPLANATION OF F(l,n-2;.05) = 4.7, ANP 50 ON.) 



* THE VALUE 74.5 WA5 CALCULATED USING UNROUNDED NUMBERS. 
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STEP 6: MAKE A PPEPICTION' 


X 



* THIS CALCULATION WAS PFPFOPMFP USIN6 POUNPFP FI6UPFS. 

IF YOU'PF POINO THF CALCULATION WITH THF FULL, 

UNPOUNPFP FIOUPFS, YOU SHOULP OFT 64.6. ASSUMPTIONS OF NORMALITY 3S 



































THAT'S A (3REAT 
QUESTION. 



WE SHOULP 
OET CLOSE TO 
64 ORPERS BECAUSE 
THE VALUE OF R 2 IS 
0.8225, BUT... 
HOW CLOSE? 


WE'LL MAKE A 
PREPICTION INTERVAL/ 


WE'LL PICK A 
COEFFICIENT ANP 
THEN CALCULATE A 
RANOE IN WHICH ICEP 
TEA ORPERS WILL 
MOST LIKELY FALL. 





if PIPN'T WE 
J| JUST VO 
THAT? , 



NOT QUITE. BEFORE, WE 
WERE PREPICTINO THE 
MEAN NUMBER OF ICEP 
TEA ORPERS FOR THE 
POPULATION OF PAYS 
WITH A CERTAIN HI6H 
TEMPERATURE, BUT NOW 
WE'RE PREPICTIN6 THE 
LIKELY NUMBER OF ICEP 
TEA ORPERS ON A OIVEN 
PAY WITH A CERTAIN 
TEMPERATURE. 
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WHAT'S THE 
POPULATION , 

/GU\ 




/fi r 


CONFIPENCE INTERVALS 
HELP US ASSESS THE 
POPULATION. 


HOW MANY 
ORPERS WILL 
\ I SET? / 




U'Qfrb 


PREPICTION INTERVALS 
OIVE A PROBABLE RANOE 
OF FUTURE VALUES. 



AS WITH A 
CONFIPBNCB 
INTERVAL, WE 
NEEP TO CHOOSE 
THE CONFIPENCE 
COEFFICIENT 
BEFORE WE CAN PO 
THE CALCULATION. 
AOAIN, 45% ANP 44% 
ARE POPULAR. 




I JUST CALCULATEP 
AN INTERVAL, SO 
THIS SHOULP BE 
A SNAP. ^ 



THE CALCULATION IS 
VERY SIMILAR, WITH 
ONE IMPORTANT 
PIFFERENCE... 


THE PREPICTION 
INTERVAL WILL BE 
WIPER BECAUSE 
IT COVERS THE 
RANOE OF 
ALL EXPECTEP 
VALUES, NOT JUST 
WHERE THE MEAN 
SHOULP BE. / 


confipence 

INTERVAL 


PREPICTION 

INTERVAL 



THE FUTURE ^ 
IS ALWAYS / I 
SURPRIZING. / v 



NOW, TRY 
CALCULATING 
THE PREPICTION 
INTERVAL 
FOR 27°C. 



ASSUMPTIONS OF NORMALITY <V 









HERB'S HOW WE CALCULATE A 95% PRBPICTION 
INTERVAL FOR TOMORROW'S ICEP TEA SALES. 


This is the prediction interval. 





Number of 
orders of iced tea 

-> 


64.6 - 13.1 = 51.5 27 x a + b 64.6 + 13.1 = 77.7* 

= 27 x 3.7 - 36.4 
= 64.6 


Distance from the estimated value is 

ii 

1 

= 1 

/ 

F(l, n - 2; .05) x 

V 

( ~\ 2 ^ ^ 

t 1 (x 0 - x) S e 

n S** J n-2 

If(1,14-2;.05) x 

3.1 

f 1 (27 - 29. l) 2 'l 391.1 

1 + + x 

14 129.7 14-2 

V J 


THE ESTIMATEP NUMBER OF TEA ORPERS WE 
CALCULATEP EARLIER CON PACE 95) WA5 ROUNPEP, 
BUT WE'VE U5EP THE NUMBER OF TEA ORPERS 
ESTIMATEP USING UNROUNPEP NUMBERS, 64.6, HERE. 


HERE WE USEP THE F PISTRIBUTION TO FINP THE 
PREPICTION INTERVAL ANP POPULATION REGRESSION. 
TYPICALLY, STATISTICIANS USE THE T PISTRIBUTION 
TO GET THE SAME RESULTS. 

_ 



* THIS CALCULATION WAS PERFORMEP USING THE ROUNPEP NUMBERS 
SHOWN HERE. THE FULL, UNROUNPEP CALCULATION RESULTS IN 77.6. 


SO WE'RE 95% CONFIPENT 
THAT THE NUMBER OF 
ICEP TEA ORPERS WILL BE 
BETWEEN 52 ANP 78 WHEN 
THE HIGH TEMPERATURE 
FOR THAT PAY IS 27°C. 
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whioh wtczee aky? 

Remember the regression analysis procedure introduced on 
page 68? 

1. Draw a scatter plot of the independent variable versus the 
dependent variable. If the dots line up, the variables may be 
correlated. 

2. Calculate the regression equation. 

3. Calculate the correlation coefficient ( R) and assess our popula¬ 
tion and assumptions. 

4. Conduct the analysis of variance. 

5. Calculate the confidence intervals. 

6. Make a prediction! 

In this chapter, we walked through each of the six steps, but it 
isn’t always necessary to do every step. Recall the example of Miu’s 
age and height on page 25. 

' Fact: There is only one Miu in this world. 

# Fact: Miu’s height when she was 10 years old was 137.5 cm. 

Given these two facts, it makes no sense to say that “Miu’s 
height when she was 10 years old follows a normal distribution 
with mean Ax + B and standard deviation a.” In other words, it’s 
nonsense to analyze the population of Miu’s heights at 10 years old. 
She was just one height, and we know what her height was. 

In regression analysis, we either analyze the entire population 
or, much more commonly, analyze a sample of the larger popula¬ 
tion. When you analyze a sample, you should perform all the steps. 
However, since Steps 4 and 5 assess how well the sample represents 
the population, you can skip them if you’re using data from an entire 
population instead of just a sample. 

NOTE We use the term statistic to describe a measurement of a char¬ 
acteristic from a sample , like a sample mean , and parameter to 
describe a measurement that comes from a population , like a 
population mean or coefficient . 


£TANPARPIZ£P ReSIPUAL 

Remember that a residual is the difference between the measured 
value and the value estimated with the regression equation. The 
standardized residual is the residual divided by its estimated 
standard deviation. We use the standardized residual to assess 
whether a particular measurement deviates significantly from 
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the trend. For example, say a group of thirsty joggers stopped 
by Norns on the 4th, meaning that though iced tea orders were 
expected to be about 76 based on that day’s high temperature, cus¬ 
tomers actually placed 84 orders for iced tea. Such an event would 
result in a large standardized residual. 

Standardized residuals are calculated by dividing each residual 
by an estimate of its standard deviation, which is calculated using 
the residual sum of squares. The calculation is a little complicated, 
and most statistics software does it automatically, so we won’t go 
into the details of the calculation here. 

Table 2-1 shows the standardized residual for the Norns data 
used in this chapter. 

TABLE 2-1: CALCULATING THE STANPAKPIZEP KESIPUAL 


High 

temperature 

X 

Measured 

number of 

orders of 

iced tea 

y 

Estimated 

number of 

orders of 

iced tea 
y = 3.7x - 36.4 

Residual 

y-y 

Standardized 

residual 

22nd (Mon.) 

29 

77 

72.0 

5.0 

0.9 

23rd (Tues.) 

28 

62 

68.3 

- 6.3 

- 1.2 

24th (Wed.) 

34 

93 

90.7 

2.3 

0.5 

25th (Thurs.) 

31 

84 

79.5 

4.5 

0.8 

26th (Fri.) 

25 

59 

57.1 

1.9 

0.4 

27th (Sat.) 

29 

64 

72.0 

- 8.0 

- 1.5 

28th (Sun.) 

32 

80 

83.3 

- 3.3 

- 0.6 

29th (Mon.) 

31 

75 

79.5 

- 4.5 

- 0.8 

30th (Tues.) 

24 

58 

53.3 

4.7 

1.0 

31st (Wed.) 

33 

91 

87.0 

4.0 

0.8 

1st (Thurs.) 

25 

51 

57.1 

- 6.1 

- 1.2 

2nd (Fri.) 

31 

73 

79.5 

- 6.5 

- 1.2 

3rd (Sat.) 

26 

65 

60.8 

4.2 

0.8 

4th (Sun.) 

30 

84 

75.8 

8.2 

1.5 


As you can see, the standardized residual on the 4th is 1.5. If 
iced tea orders had been 76, as expected, the standardized residual 
would have been 0. 

Sometimes a measured value can deviate so much from the 
trend that it adversely affects the analysis. If the standardized 
residual is greater than 3 or less than -3, the measurement is 
considered an outlier . There are a number of ways to handle out¬ 
liers, including removing them, changing them to a set value, 
or just keeping them in the analysis as is. To determine which 
approach is most appropriate, investigate the underlying cause 
of the outliers. 
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INTERPOLATION ANP EXTRAPOLATION 


If you look at the x values (high temperature) on page 64, you can 
see that the highest value is 34°C and the lowest value is 24°C. 
Using regression analysis, you can interpolate the number of iced 
tea orders on days with a high temperature between 24°C and 34°C 
and extrapolate the number of iced tea orders on days with a high 
below 24°C or above 34°C. In other words, extrapolation is the esti¬ 
mation of values that fall outside the range of your observed data. 

Since we’ve only observed the trend between 24°C and 34°C, we 
don’t know whether iced tea sales follow the same trend when the 
weather is extremely cold or extremely hot. Extrapolation is there¬ 
fore less reliable than interpolation, and some statisticians avoid it 
entirely. 

For everyday use, it’s fine to extrapolate—as long as you’re 
aware that your result isn’t completely trustworthy. However, avoid 
using extrapolation in academic research or to estimate a value 
that’s far beyond the scope of the measured data. 


AUTOCORRELATION 

The independent variable used in this chapter was high tempera¬ 
ture; this is used to predict iced tea sales. In most places, it’s 
unlikely that the high temperature will be 20°C one day and then 
shoot up to 30°C the next day. Normally, the temperature rises or 
drops gradually over a period of several days, so if the two variables 
are related, the number of iced tea orders should rise or drop grad¬ 
ually as well. Our assumption, however, has been that the deviation 
(error) values are random. Therefore, our predicted values do not 
change from day to day as smoothly as they might in real life. 

When analyzing variables that may be affected by the passage of 
time, it’s a good idea to check for autocorrelation. Autocorrelation 
occurs when the error is correlated over time, and it can indicate 
that you need to use a different type of regression model. 

There’s an index to describe autocorrelation—the Durbin- 
Watson statistic , which is calculated as follows: 


d = 
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The equation can be read as “the sum of the square of 
each residual minus the previous residual, divided by the sum 
of each residual squared.” You can calculate the value of the 
Durbin-Watson statistic for the example in this chapter: 

(-6.3 - 5.0) 2 + (2.3 - (-6.3)) 2 + • • • + (8.2 - 4.2) 2 
5.0 2 + (-6.3) 2 + • • • + 8.2 2 


The exact critical value of the Durbin-Watson test differs for 
each analysis, and you can use a table to find it, but generally we 
use 1 as a cutoff: a result less than 1 may indicate the presence of 
autocorrelation. This result is close to 2, so we can conclude that 
there is no autocorrelation in our example. 


NONLINEAR SECESSION 

On page 66, Risa said: 

THE &OM OF KEOKESSION ANALYSIS 
IS TO OBTAIN THE KEOKESSION 
EQUATION IN THE FORM OF 
y = ax + b. 

_____ J 

This equation is linear, but regression equations don’t have 
to be linear. For example, these equations may also be used as 
regression equations: 

a , 

• y = — + b 
x 

y = ajx + b 
y = ax 2 + bx + c 
y = ax log x + b 

The regression equation for Miu’s age and height introduced on 

page 26 is actually in the form of y = — + b rather than y = ax + b. 

x 
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Of course, this raises the question of which type of equation 
you should choose when performing regression analysis on your 
own data. Below are some steps that can help you decide. 

1. Draw a scatter plot of the data points, with the dependent vari¬ 
able values on the x-axis and the independent variable values on 
the y-axis. Examine the relationship between the variables sug¬ 
gested by the spread of the dots: Are they in roughly a straight 
line? Do they fall along a curve? If the latter, what is the shape 
of the curve? 

2. Try the regression equation suggested by the shape in the 
variables plotted in Step 1. Plot the residuals (or standardized 
residuals) on the y-axis and the independent variable on the 
x-axis. The residuals should appear to be random, so if there 
is an obvious pattern in the residuals, like a curved shape, this 
suggests that the regression equation doesn’t match the shape 
of the relationship. 

3. If the residuals plot from Step 2 shows a pattern in the residuals, 
try a different regression equation and repeat Step 2. Try the 
shapes of several regression equations and pick one that appears 
to most closely match the data. It’s usually best to pick the sim¬ 
plest equation that fits the data well. 


TRANSFORMING NONLINEAR EQUATIONS INTO LINEAR EQUATIONS 

There’s another way to deal with nonlinear equations: simply turn 
them into linear equations. For an example, look at the equation for 
Miu’s age and height (from page 26): 


You can turn this into a linear equation. Remember: 

If — = X, then — = x. 
x X 

1 

So we’ll define a new variable X, set it equal to — , and use X 
in the normal y = aX + b regression equation. As shown on page 76, 
the value of a and b in the regression equation y = aX + b can be 
calculated as follows: 



[b = y-Xa 


104 CHAPTER 2 SIMPLE REGRESSION ANALYSIS 



We continue with the analysis as usual. See Table 2-2. 


TALEE 2-2: CALCULATING THE REGRESSION EQUATION 


Age 

X 

1 

age 

± = x 

X 

Height 

y 

(x-x) 

y-y 

(x-x) 2 

G y-y f 

(X-X)(y-y) 

4 

0.2500 

100.1 

0.1428 

-38.1625 

0.0204 

1456.3764 

-5.4515 

5 

0.2000 

107.2 

0.0928 

-31.0625 

0.0086 

964.8789 

-2.8841 

6 

0.1667 

114.1 

0.0595 

-24.1625 

0.0035 

583.8264 

-1.4381 

7 

0.1429 

121.7 

0.0357 

-16.5625 

0.0013 

274.3164 

-0.5914 

8 

0.1250 

126.8 

0.0178 

-11.4625 

0.0003 

131.3889 

-0.2046 

9 

0.1111 

130.9 

0.0040 

-7.3625 

0.0000 

54.2064 

-0.0292 

10 

0.1000 

137.5 

-0.0072 

-0.7625 

0.0001 

0.5814 

-0.0055 

11 

0.0909 

143.2 

-0.0162 

4.9375 

0.0003 

24.3789 

-0.0802 

12 

0.0833 

149.4 

-0.0238 

11.1375 

0.0006 

124.0439 

-0.2653 

13 

0.0769 

151.6 

-0.0302 

13.3375 

0.0009 

177.889 

-0.4032 

14 

0.0714 

154.0 

-0.0357 

15.7375 

0.0013 

247.6689 

-0.5622 

15 

0.0667 

154.6 

-0.0405 

16.3375 

0.0016 

266.9139 

-0.6614 

16 

0.0625 

155.0 

-0.0447 

16.7375 

0.0020 

280.1439 

-0.7473 

17 

0.0588 

155.1 

-0.0483 

16.8375 

0.0023 

283.5014 

-0.8137 

18 

0.0556 

155.3 

-0.0516 

17.0375 

0.0027 

290.2764 

-0.8790 

19 

0.0526 

155.7 

-0.0545 

17.4375 

0.0030 

304.0664 

-0.9507 

Sum 184 

1.7144 

2212.2 

0.0000 

0.0000 

0.0489 

5464.4575 

-15.9563 


Average 11.5 0.1072 138.3 


According to the table: 


a 


S Xy -15.9563 
0.0489 


-326.6 


b = y - Xa = 138.2625 - 0.1072 x (-326.6) = 173.3 


So the regression equation is this: 

y = -326.6X +173.3 

t t 

height 1 

age 


* If your result is slightly different from 326.6, the difference might be due to 
rounding. If so, it should be very small. 
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which is the same as this: 


326.6 


+ 173.3 


t t 


height age 

We’ve transformed our original, nonlinear equation into a 
linear one! 
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WE'RE GOING TO 
COVER MUUTIPUE 
REGRESSION ANALYSIS 
TOPAY, ANP KAZAMI 
BROUGHT US SOME 
PATA TO ANALYZE. 


\ Hpfof 5 F °Z 

\ HBLp\H0 / v (£ 



NO PROBLEM. 
THIS IS (SOINO TO 
HELP ME, TOO. 


YOU LIKE 
CROISSANTS, 
PON'T YOU, 
MIU? 



m 


OF | 
COURSE/ ^ 
THEY'RE 
PELICIOUS. 



WHICH BAKERY 
IS YOUR 

. FAVORITE? / 


PEFINITELY KAZAMI 
BAKERY-THEIRS 
ARE THE BEST/ 






















































































it's just a 

SMALL FAMILY 
BUSINESS. 


THERE ARE ONLY 
TEN RIGHT NOW, 
ANP MOST OF 
THEM ARE HERE 
. IN THE CITY y 






wakaba 

' RIVe RSIPbJ 








ISEBA6-HI 


WE'RE PLANNING* 
TO OPEN A NEW 
ONE SOON. 



SO TOPAY. 


WE'RE GOING TO 
PREPICT THE SALES 
OF THE NEW SHOP 
USING MULTIPLE 
pe&peeeioN 
ANALYSE. 
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ACCORDING TO MY NOTES, 

multiple regression 

ANALYSIS USES MOKE 
THAN ONE FACTOR TO 
PREDICT AN OUTCOME. 



IN SIMPLE REGRESSION 
ANALYSIS, WE USEP ONE 
VARIABLE TO PREPICT 
THE VALUE OF ANOTHER 
VARIABLE. 


IN MULTIPLE 
REGRESSION ANALYSIS, 
WE USE MORE THAN ONE 
VARIABLE TO PREPICT 
THE VALUE OF OUR 
OUTCOME VARIABLE. ^ 


MULTIPLE REGRESSION EQUATION 


4 O 2 X 1 4 - -tOpXp-tb 

A A /N * * A • 


OUTCOME PREPICTOR 

VARIABLE VARIABLES 


PARTIAL REGRESSION 
COEFFICIENTS 


ANP JUST ONE OUTCOME 
VARIABLE, y. LIKE THIS, SEE? 


I GET IT/ 


REGRESSION 

ANALYSIS 


MULTIPLE REGRESSION ANALYSIS 

















THE MULTIPLE REGRESSION EQUATION 


ARE THE STEPS 
THE SAME AS IN 
SIMPUE REGRESSION 
ANALYSIS? 






STEP 1 


STEP 2 


MULTIPLE REGRESSION ANALYSIS PROCEDURE 


PRAW A SCATTER PLOT OF EACH PREPICTOR VARIABLE ANP THE 
OUTCOME VARIABLE TO SEE IF THEY APPEAR TO BE RELATEP. 


CALCULATE THE MULTIPLE REGRESSION EQUATION. 


STEP 3 EXAMINE THE ACCURACY OF THE MULTIPLE REGRESSION EQUATION. 


STEP 4 


CONPUCT THE ANALYSIS OF VARIANCE CANOVA) TEST. 


STEP 5 CALCULATE CONFIPENCE INTERVALS FOR THE POPULATION. 


STEP 6 


WE HAVE TO LOOK AT 
EACH PREPICTOR AUONE 
ANP ALL OF THEM 
TOGETHER. 


MAKE A PREPICTION' 
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FTFP 1: PPAW A SCATTFP PLOT OF FACH PPFPIOTOP VAPIA3LF ANP THF 
OUTCOMF VAPIA3LF TO 5FF IF THFY APPFAP TO 3F PFLATFP. 



Bakery 

Floor space 
of the shop 
(tsubo*) 

Distance to the 

nearest station 

(meters) 

Monthly sales 
(¥10,000) 

Yumenooka Shop 

10 

80 

469 

Terai Station Shop 

8 

0 

366 

Sone Shop 

8 

200 

371 

Hashimoto Station Shop 

5 

200 

208 

Kikyou Town Shop 

7 

300 

246 

Post Office Shop 

8 

230 

297 

Suidobashi Station Shop 

7 

40 

363 

Rokujo Station Shop 

9 

0 

436 

Wakaba Riverside Shop 

6 

330 

198 

Misato Shop 

9 

180 

364 

* 1 tsubo is about 36 square feet. 
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MONTHLY SALES 


CORRELATION COEFFICIENT = .8424 


CORRELATION COEFFICIENT = - 77E1 


► Yumenooka Shop 


Rokujo Station Shop • 


Suidobashi Station Shop « 


Sone Shop 
t • Misato Shop 
Terai Station Shop 


Post Office Shop 


I Hashimoto Station Shop < 


• Kikyou Town Shop 
► Wakaba Riverside Shop 


• Yumenooka Shop 
1 Rokujo Station Shop 


Terai Station Shop 

Suidobashi 
Station Shop 


m • Sone Shop 
Misato Shop 


• Post Office Shop 
Kikyou Town Shop • 


Hashimoto Station Shop < 


Wakaba* 
Riverside Shop 


FLOOR 5PACE 


S> 100 150 200 300 30 

PITTANCE TO THE NEAREST STATION 



TOTALLY. 


MONTHLY SALES 
AI?E HIOHEP IN 
BIOOEP SHOPS ANP 
IN SHOPS NEAP A 
TPAIN STATION. 


' LOOKS LIKE 
BOTH OF 
THESE AFFECT 
MONTHLY 
SALES. y 
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STEP 2: CALCULATE THE MULTIPLE RECESSION EQUATION. 


A LOT OF THE 
COEFFICIENTS WE 
CALCULATE!? PURINO 
SIMPLE REGRESSION 
ANALYSIS ARE ALSO 
INVOLVE!? IN MULTIPLE 
REGRESSION. 




m 

llli 


3UT TH3 

CALCULATION 15 A 3IT 
MOP5 COMPLICATE?. 
PO YOU P5M5M35P 
THE METHOP 
WE U5EP? > 


ISfegi 


THAT'S RIOHT/ 
LET'S REVIEW. 


FIRST GET THE RESIPUAL 
SUM OF SQUARES, S e . 


>&“f tfoxtio+b)?'' 
■+{$6&~(Ch 0-tb)l 
+ ••• 

9+&*/»+«? 


ANP PRESTO/ 
PIECE OF CAKE. 


HI 
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FLOOR SPACE 


'10 so 


I 


0 DO 200 300 23 o 40 0 33° ISO' 
I I I I I I I I I j 


: o 

TOO 

200 

300 

230 

40 

i o 

330 


o 


-i 


(10 S S 5 7 8 7 <? 6 9\ 

SO 0 200 200 300 230 40 0 3# ISO 


III I I 4 i || 


1/ 


(*?n 

360 
371 
1(8 
216 
297 
363 
4)6 
198 

m 


THE VS ARE A BASELINE 
MULTIPLIER USED 

TO CALCULATE THE 

1 

1 

1 

l 

_1_ 

1 

DISTANCE TO THE 

i 



NEAREST STATION 


INTERCEPT CM. 







MONTHLY SALES 
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SO HERE'S 
THE EQUATION. 


%=4ISZr037z+653 

t t t 


MONTHUY 

SALES 



FLOOR 

SPACE 


PISTANCE TO 
THE NEAREST 
STATION 


YOU SHOULP 
WRITE THIS 




f THERE'S 
ONE MORE 
VTHINO..^ 




...THAT WILL HELP 
yOU UNPERSTANP THE 
MULTIPLE REGRESSION 
EQUATION ANP MULTIPLE 
REGRESSION ANALySIS. 


WHAT 
IS IT? 
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THE LINE PLOTTBP BY THE 
MULTIPLE REGRESSION EQUATION 
y = ci!*! + a 2 x 2 + ■■■ + cipXp + b 
WILL ALWAYS CROSS THE POINTS 

(x t . x 2 . x p , y ), WHERE 

Xi IS THE AVERAGE OF x f . 


THIS SEEMS 
FAMILIAR... 


-^MY-EKAIN:IS- 
A -MELTING.— 




ETHJNSE 

-THINKt= 

WHERE 

EHAVE= 

I : SEE N = 

-THIS?— 


TO PUT IT PIFFEPENTLY, OUR EQUATION y = 41.5X! - 0.3x 2 + 65.3 WILL 
ALWAYS CREATE A LINE THAT INTERSECTS THE POINTS WHEPE AVERAGE 
FLOOP SPACE ANP AVERAGE PISTANCE TO THE NEAPEST STATION 
INTERSECT WITH THE AVERAGE SALES OF THE PATA THAT WE USEP. 



XOH YEAH' WHENTX 
WE PLOT OUR 
EQUATION, THE LINE 
PASSES THROUGH 
V ' S _TH£ AVERAGES. > 


STEP 3: EXAMINE THE ACCURACY OF THE MULTIPLE REGRESSION EQUATION. 


SO NOW WE HAVE 
AN EQUATION, 
BUT HOW WELL 
CAN WE REALLY 
PREPICT THE 
SALES OF THE 
NEW SHOP? 



WE'LL FINP OUT 
USING REGRESS/CW 
PIA&N05TIC5. WE'LL 
NEEP TO FINP R 2 , ANP 
IF IT'S CLOSE TO 1, 
THEN OUR EQUATION IS 
V PRETTY ACCURATE/ . 
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zeroize we fine? r 2 , we neee? to fine? plain ole? 

R, WHICH IN THI5 CA5E 15 CALLEE? THE MULTIPLE 
COPPELATION COEFFICIENT. PEMEM3EP: R 15 A WAY 
OF COMPAPINO THE ACTUAL MEA5UPEE? VALUE5 Cy? 
WITH OUP E5TIMATEE? VALUE5 Cyl* * 




( 

WE PON'T NEEP S 
YET, BUT WE WILL 
U5E IT UATEK. 

sum of (y-y)(y-y) 

S yy 

Jsum of (y - y) 2 x sum of (y - y) 

X ^yy 

72026.6 



V76199.6 x 72026.6 


K 2 = (.9722) 2 = .9452 


* AS IN CHAPTFP 2/ SOME OF THF FIOUPFS IN THIS CHAPTFP APS POUNPFP FOP 
PFAPA3IUTY, BUT ALL CALCULATIONS APS PONS USINO THF FULL, UNPOUNPFP 
VALUES PFSULTINO FPOM THE PAW PATA UNLESS OTHERWISE STATEP. 















Na BUT .5 CAN 
AOAIN BE USEP AS 
A LOWER LIMIT. 


YOU CAN SIMPLIFY THE R 2 CALCULATION. 

I WONT EXPLAIN THE WHOLE THINO, 
BUT BASICALLY IT'S SOMETHING LIKE THIS* 


R 2 = (multiple correlation coefficient) 2 


a i S lu + a 2 S 2u + • • • + Cl S 



■■ REFER TO PAOE 144 FOR AN EXPLANATION OF S ly , S 2y . S py . 
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THE TROUBLE WITH R z 


BEFORE YOU OCT 
TOO EXCITEP, THERE'S 
SOMETHING YOU 
v SHOULP KNOW... A 



THIE R 2 MI<3HT 
BE MIELEAPIN& 


WE PIP THE 
CALCULATIONE 
PERFECTLY. 


HOW CAN 
THIE BE? 



^WELL, THE 
TROUBLE IE.. 


EVERy TIME WE 
APP A PREPICTOR 
VARIABLE p... 




...R- <EETE LAR6Er3 
6UARANTEEP. 


HUH?/ 



EUPPOEE WE APP THE A6E 
OF THE EHOP MANAGER 
TO THE CURRENT PATA. 


BEFORE APPIN6 
ANOTHER VARIABLE, 
THE R 2 WAE .3452. 


Bakery 

Floor 
area of 
the shop 
(tsubo) 

Distance 

to the 

nearest 

station 

(meters) 

s''', 

N 

Shop 

manager’s 
age (years) 

^ Monthly 
sales 
(¥10,000) 

Yumenooka Shop 

10 

80 

42 

469 

Terai Station Shop 

8 

0 

29 

366 

Sone Shop 

8 

200 

33 

371 

Hashimoto Station Shop 

5 

200 

41 

208 

Kikyou Town Shop 

7 

300 

33 

246 

Post Office Shop 

8 

230 

35 

297 

Suidobashi Shop 

7 

40 

40 

363 

Rokujo Station Shop 

9 

0 

46 

436 

Wakaba Riverside Shop 

6 

330 

44 

198 

Misato Shop 

9 

180 

34 

364 






AFTER APPINO 
THIE VARIABLE... 
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FLOOR 
AREA . 
OF THB 

shop 1 


PITTANCE 
TOTHS 
NBARBSTC 
STATION L 


AOS 7 f 
OF THS JT7\ 
SHOP 
MAHA&BR 


FLOOR 
ARBA 
OF THB 
SHOP 




PITTANCE 
TO THB 
HBARBBTQ 
STATION L 


AOB 1 
OF THB^/ 
SHOP M/ 
MAHAOBR 


F?= -94S2. 


.IT'S 4445! 


AS YOU CAN 555, 
W5 LA R65IZ. 


BUT WHEN WE 
PLOT ACE VERSUS 
MONTHLY SALES, 
THERE 15 NO 
PATTERN, 50... 



CORRELATION COEFFICIENT - .0368 

Yumenooka Shop * 1 

Rokujo Station Shop • 
Sone Shop 

Terai Station Shop • » Suidobashi: 

Misato Shop station 

Post Office Shop • Shop 

Kikyou Town Shop • 

Hashimoto Station Shop • m 


YET PBSPITB THAT, 
THB VALUE OF R 2 
INCREASED 


10 20 >0 4 o 

/AOE OF THE SHOP MANAGER 


THB AOB OF THB 
SHOP MANAOER HAS 
NOTHING TO PO WITH 
MONTHLY SALBS/ 



NEVER \S 
FEAR. 




THB APJU5TEP COEFFICIENT 
OF PBTBPMI NATION, AKA 
APJU5TEP R 2 , WILL SAVE US/ 


WHAT? 

ANOTHER R? 























APJUSTet? R z 


THE VALUE OF APJUETEP R 2 (R 2 ) CM EE 
OETAINEP EY UEINO THIE FORMULA. 




MIU, COULP YOU 
FINP THE VALUE OF 
APJUETEP R 2 WITH ANP 
WITHOUT THE AOE OF 
THE EHOP MANAOEK? 
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HOW ABOUT 
WHEN WE ALSO 
INCLUPE THE SHOP 
MANAGER'S AOE? 


WE'VE ALPEAPY 
eOl R 2 FOP THAT, 
PIOHT? 


PISTAN ce 
TO THE 
NEARESTC 
STATION L 


AOS <T ? 
OF THE J—l 
SHOP tfcjj 
MANAOEk"—' 


YES, IT'S 
.Q4Q5. 


R 2 = .9495 


WHAT APE S yy ANP S e 
IN THIS CASE? 


i? 2 = 1 _ l sa mple size 


^^L^fpredictoT: 


Syy IS THE SAME AS 
BEFORE. XT'S 76m6. 


PPEPICTOP VARIABLES: 

• FLOOPAPEA 

• PISTANCE 

• MANAGER'S AOE 


sample size - number of predictor variables -1 

7 a! 1 


3846.4 

10-3-1 

76199.6 

10-1 


sample size -1 


.9243 


WAIT A 
MINUTE... 
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PREPICTOR VARIABLES 


LOOK! THE VALUE 
OF APJUETEP R 2 IE 
LARGER WHEN THE 
AGE OF THE SHOP 
MANAGER IE NOT 
IN CLUPEP. 



© 

FLOOR 

AREA ANP 
PITTANCE 

<D 

FLOOR AREA, 
PITTANCE, 

ANP AGE 

) F? 

-9USZ < 

' .ftfS: 

' /? 

(i?r> 

.fzVB 
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HYPOTHESIS TESTING WITH 
MULTIPLE REGRESSION 


51NOE WE'RE HAPPY 
WITH APJU5TEP R 2 , 
WE'LL TE5T OUR 
A55UMPTION5 
ABOUT THE , 
. POPULATION, y 



POYOU 
REMEMBER 
HOW WE PIP 
THE HYPOTHESIS 
TESTINO BEFORE? 



WE'LL PO 
HYPOTHE5I5 ANP 
REORE55ION 
OOEFFI6IENT TE5T5, 
RIOHT? 


Yes, BUT IN 

MULTIPLE REGRESSION 
ANALYSIS, WE HAVE 
PARTIAL REGRESSION 
COEFFICIENTS, INSTEAP. 


I THINK SO. WE 
TESTEP WHETHER 
THE POPULATION 
MATCHEP THE 
EQUATION ANP THEN 
CHECKEP THAT A 
PIPN'T EQUAL ZERO. 


RIOHT.' IT'5 
BA5ICALLY 
THE 5AME 
WITH MULTIPLE 
REORE55ION. 



ALTERNATIVE HYPOTHESIS 

IF THE FLOOR AREA OF THE SHOP 
15 x x TSUBO ANP THE PISTANCE TO 
THE NEARE5T 5TATION 15 x 2 METER5, 
THE MONTHLy 5ALE5 FOLLOW A NORMAL 
PISTRIBUTION WITH MEAN + A 2 x 2 + B 
ANP 5TANPARP PEVIATION a. 



NOW, WE HAVE 
MORE THAN ONE x 
ANP MORE THAN ONE 
A. AT LEAST ONE OF 
THESE A'S MUST NOT 
EQUAL ZERO. 
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STEP 4: COHVVCT THE ANALYSIS OF VARIANCE CANOVA) TEST. 


HERE ARE OUR 
ASSUMPTIONS ABOUT THE 
PARTIAL- REGRESSION 
COEFFICIENTS. a J; a 2 , ANP b 
ARE COEFFICIENTS OF THE 
ENTIRE POPULATION. / 


As 


THE EQUATION // 
SHOULP j 
REFLECT THE /' 
POPULATION.^/ / 


IF THE REGRESSION EQUATION OBTAINEP IS 


y = + a 2 x 2 + b 

• Aj IS APPROXIMATELY a v 

• A 2 IS APPROXIMATELY a 2 . 

• B IS APPROXIMATELY b. 


J sample size - number of predictor variables -1 


COULP YOU APPLY 
THIS TO KAZAMI 
BAKERY'S PATA? 
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ONE TESTS ALL THE 
PARTIAL REGRESSION 
COEFFICIENTS TOGETHER. 


HYPOTHESIS /Al-0 ANP Az~0 


ALTERNATIVE A.-Ao-C) 
HYPOTHESIS NOT/HI-/12-U 

IN OTHER WORPS, ONE OF 
THE FOLLOWING IS TRUE: 


'A 19^ 0 ANP Az^O 
'A\^0 ANP Ai — 0 
'A 2 * 0 ANP ^2^0 


THE OTHER TESTS THE 
INPIVIPUAL PARTIAL 
REGRESSION COEFFICIENTS 
SEPARATELY. , 


NULL 

HYPOTHESIS 


ALTERNATIVE 

HYPOTHESIS 


Al = o 


Al*0 



' SO, WE HAVE 
TO REPEAT THIS 
TEST FOR EACH 
OF THE PARTIAL 
REGRESSION 
COEFFICIENTS? 



LET'S SET THE 
SIGNIFICANCE 
LEVEL TO .05. 
ARE YOU REAPY 
TO TRY POING 
THESE TESTS? 
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FIRST, WE'Ll TEST ALL THE PARTIAL 
REGRESSION COEFFICIENTS TOGETHER. 



THE STEPS OF ANOVA 


Step 1 

Define the population. 

The population is all Kazami Bakery shops. 

Step 2 

Set up a null hypothesis and 
an alternative hypothesis. 

Null hypothesis is A x = 0 and A 2 = 0. 

Alternative hypothesis is that A x or A 2 or both * 0. 

Step 3 

Select which hypothesis test 
to conduct. 

We’ll use an F-test. 

Step 4 

Choose the significance level. 

We’ll use a significance level of .05. 

Step 5 

Calculate the test statistic 
from the sample data. 

The test statistic is: 

S -S 


number of predictor variables 

S„ 


sample size - number of predictor variables -1 
76199.6-4173.0 4173.0 


10 - 2-1 


= 60.4 


Step 6 Determine whether the 

p-value for the test statistic 
obtained in Step 5 is smaller 
than the significance level. 

Step 7 Decide whether you can reject 
the null hypothesis. 


The test statistic, 60.4, will follow an F distribution 
with first degree of freedom 2 (the number of predictor 
variables) and second degree of freedom 7 (sample size 
minus the number of predictor variables minus 1), if the 
null hypothesis is true. 

At significance level .05, with d x being 2 and d 2 being 7 
(10 - 2 - 1), the critical value is 4.7374. Our test statistic 
is 60.4. 

Since our test statistic is greater than the critical value, 
we reject the null hypothesis. 
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NEXT, WE'LL. TEET THE INPIVIPUAL PARTIAL 
RE<3REEEI0N COEFFICIENTE. X WILL VO 
THIS FOR A x AE AN EXAMPLE. 


THE ETEPE OF ANOVA 



Step 1 

Define the population. 

The population is all Kazami Bakery shops. 

Step 2 

Set up a null hypothesis and 
an alternative hypothesis. 

Null hypothesis is A x = 0. 

Alternative hypothesis is A 1 * 0. 

Step 3 

Select which hypothesis test 
to conduct. 

We’ll use an F-test. 

Step 4 

Choose the significance level. 

We’ll use a significance level of .05. 

Step 5 

Calculate the test statistic 
from the sample data. 

The test statistic is: 

a i . S e 

S u sample size - number of predictor variables -1 

41.5 2 4173.0 

0.0657 10-2-1 ” 

The test statistic will follow an F distribution with 
first degree of freedom 1 and second degree of freedom 

7 (sample size minus the number of predictor variables 
minus 1), if the null hypothesis is true. (The value of S n 
will be explained on the next page.) 

Step 6 

Determine whether the 
p-value for the test statistic 
obtained in Step 5 is smaller 
than the significance level. 

At significance level .05, with d x being 1 and d 2 being 7, 
the critical value is 5.5914. Our test statistic is 44. 

Step 7 

Decide whether you can reject 
the null hypothesis. 

Since our test statistic is greater than the critical value, 
we reject the null hypothesis. 


REGARPLESS- OF THE RESULT OF STEP 7, 
IF THE VALUE OF THE TEST STATISTIC 


S u sample size - number of predictor variables -1 

IS 2 OR MORE, WE STILL CONSIPER THE PREPICTOR VARIABLE 
CORRESPONPINO TO THAT PARTIAL REGRESSION COEFFICIENT 
TO BE USEFUL FOR PREDICTING THE OUTCOME VARIABLE. 
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FINPIN6 S„ ANP 6- 22 


(\0S8SV8V96 9 


W 


I 


FLOOR SPACE 


PITTANCE TO THE 
NEAREST STATION 


'10 

so 

iTI 

1 1 

s 

0 

1 1 
i 

» 

700 

1 ■ 
i 

x* 

2 00 

l 

1 ■ 

7 

300 

1 ■ 
i 

8 

230 

i 

■ ■ 


40 

1 

9 

0 

1 1 
i 

6 

$0 

i 

1 1 

(9 

ISO 

*ij 


THIS IS THE Sn THAT 
APPEAREP IN STEP 5. 


( 0 . 0657 ) • ■ ■ 

• • • ( 0 . 00001 ) 


THIS IS S 22 . 


YOU NEEP TO APP A LINE WITH A 1 IN ALL ROWS ANP COLUMNS. 


we USe A MATRIX TO FINP S n ANP S 22 . 

we Neepep s„ to calculate THe TesT 

STATISTIC ON THe PRBVIOUS PA<3e, ANP WE 

use s 22 to TesT our secoNp coeepicieNT 
iNpepeNpeNTLY, in THe saM e way* 




* SOME PEOPLE USE THE t PISTRIBUTION INSTEAP OF THE F PISTRIBUTION 
WHEN EXPLAINING THE "TEST OF PARTIAL REGRESSION COEFFICIENTS." YOUR 
FINAL RESULT WILL BE THE SAME NO MATTER WHICH METHOP YOU CHOOSE. 
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STEP 5: CALCULATE CONFIDENCE INTERVALE FOR THE POPULATION. 


WHAT'S 
NEXT? WAS IT 
SOMETHING ABOUT 
CONFIDENCE? j 



IT STARTS OUT LIKE 
SIMPLE REGRESSION 
ANALYSIS. BUT THEN 
THE MAHALANOBIS 
DISTANCE* COMES IN, 
AND THINGS GET VERY 
COMPLICATED VERY 
QUICKLY. 


'Wirf'ir 




MAHALA... WHAT? 



WOW. VO YOU 
THINK WE CM 
VO IT? 


I KNOW WB CM, 

3UT we'll ee 

HERE ALL NIGHT. 
we COVLV HAVE A 
SLUM3ER PARTY. 







* THE MATHEMATICIAN P.C. MAHALANOBIS INVENTED A WAY TO 
USE MULTIVARIATE PISTANCES TO COMPARE POPULATIONS. 
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WELL THEN, I 
OUESS WE'LL 
HAVE TO FINP OUT 
THE CONPIPENCE 
INTERVALE UEINO 
PATA ANALYSIS 
SOFTWARE. 


THIS TIME IT'S OKAY, 
BUT YOU SHOULDN'T 
ALWAYS RELY ON 
COMPUTERS. POINO 
CALCULATIONS BY 
HANP HELPS YOU 
v LEARN. > 


YOU ARE — 
SUCH A JERK 
SOMETIMES/ 
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STEP 6: MAKE A PREDICTION/ 


HERE IE THE DATA 
FOR THE NEW SHOP 
WE'RE PLANNING 
v TO OPEN. > 



Floor space 

Distance to the 


of the shop 

nearest station 


(tsubo) 

(meters) 

Isebashi Shop 

10 

110 





4-\SZ\ - 03^2 ■+ 3 

= 4 LS*l0-a3*H0 + te3 
= 4473 * ^_ 


¥4,473,000 
PER MONTH/ 


* THIS CALCULATION WAS MAPE USING ROUNPEP NUMBERS. IF YOU 
USE THE FULL, UNROUNPEP NUMBERS, THE RESULT WILL BE 442.46. 


YOU'RE A GENIUS, 
MIL!/1 SHOULP 
NAME THE SHOP 
AFTER YOU. 



BUT HOW COULP WE 
KNOW THE EXACT 
SALES OF A SHOP 
THAT HASN'T BEEN 
BUILT? SHOULP 
WE CALCULATE A 
PREPICTION INTERVAL? 


ABSOLUTELY 


IN SIMPLE REGRESSION ANALYSIS, 
THE METHOP TO FINP BOTH THE 
CONFIPENCE ANP PREPICTION 
INTERVALS WAS SIMILAR. IS 
THAT ALSO TRUE FOR MULTIPLE 
REGRESSION ANALYSIS? 
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CH005IN0 THE BEET COMBINATION OF FZEVICTOZ VARIABLE 




NOW WHO'S 
BEING PRAMATIC? 



JUST AS WITH SIMPLE 
REGRESSION ANALYSIS, 
WE CAN CALCULATE A 
MULTIPLE REGRESSION 
EQUATION USING ANY 
VARIABLES WE HAVE 
PATA ON, WHETHER OR 
NOT THEY ACTUALLY 
AFFECT THE OUTCOME 
VARIABLE. 


LIKE THE AOE OF 
THE SHOP MANAGER? 
WE USEP THAT, EVEN 
THOUOH IT PIPN'T HAVE 
ANY EFFECT ON SALES' 



^J|J& 


0 0 s 


EXACTLY. 


THE EQUATION 
BECOMES 
COMPLICATEP IF 
YOU HAVE TOO 
MANY PREPICTOR 
VARIABLES. 
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THE BEST MULTIPLE 
REGRESSION EQUATION 
BALANCES ACCURACY 
ANP COMPLEXITY BY 
INCLUPING ONLY THE 
PREPICTOR VARIABLES 
NEEPEP TO MAKE THE 
BEST PREPICTION. 


DIFFICULT 


© ^aiZi+faZz+faTb+CUTh+teZs +#6*6 + 

easy .■+ 


ACCURATE 


NOT ACCURATE 


?= 4 b 3M. Z+aai+b rq> 
ft 2 > R z & 


SHORT IS 
SWEET. 


THERE ARE SEVERAL WAYS TO FINP 
THE EQUATION THAT GIVES YOU THE 
MOST BANG FOR YOUR BUCK. 


FORWARD? SELECTION 

BACKWARD ELIMINATION 

FORWARP-BACKWARP 
STEPWISE SELECTION 

ASX A POMAIN EXPERT 
WHICH VARIABLES ARE 
THE MOST IMPORTANT 


THESE ARE SOME COMMON WAYS. 


THE METHOP WE'LL USE TOPAY 
IS SIMPLER THAN ANY OF THOSE. 
IT'S CALLEP BEET EUBEETE 
RE&REEEION, OR SOMETIMES, 
THE ZOUNP-ROBIN METHOP. 



WHAT THE HECK 
IS THAT? ■ 


*3 


I'LL SHOW 
YOU. SUPPOSE 
x v x 2 , ANP x 3 
ARE POTENTIAL 
PREPICTOR 
VARIABLES. 


FIRST, WE P CALCULATE THE 
MULTIPLE REGRESSION EQUATION 
FOR EVERY COMBINATION OF 
PREPICTOR VARIABLES' 


X, ANP X 2 * X, ANP X 2 ANP X 3 


X 2 ANP X 3 
X, ANP X 3 


HAHA. 
THIS SURE 
IS ROUND¬ 
ABOUT. 



CHOOSING THE BEST COMBINATION OF PREPICTOR VARIABLES 139 








































LET'S REPLACE x 1; 
* 2 , ANP * 3 WITH 
SHOP SIZE, DISTANCE 
TO A STATION, ANP 
MANAGER'S AGE. 



IS OUR EQUATION 
THE WINNER? 


WE'LL MAKE 
A TABLE THAT 
SHOWS THE PARTIAL 
REGRESSION 
COEFFICIENTS ANP 
APJUSTEP R 
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ASSESSING POPULATIONS WITH 
MULTIPLE PEOPESSION ANALYSIS 

Let’s review the procedure of multiple regression analysis, shown 
on page 112. 

1. Draw a scatter plot of each predictor variable and the outcome 
variable to see if they appear to be related. 

2. Calculate the multiple regression equation. 

3. Examine the accuracy of the multiple regression equation. 

4. Conduct the analysis of variance (ANOVA) test. 

5. Calculate confidence intervals for the population. 

6. Make a prediction! 

As in Chapter 2, we’ve talked about Steps 1 through 6 as if they 
were all mandatory. In reality, Steps 4 and 5 can be skipped for the 
analysis of some data sets. 

Kazami Bakery currently has only 10 stores, and of those 
10 stores, only one (Yumenooka Shop) has a floor area of 10 tsubo 1 
and is 80 m to the nearest station. However, Risa calculated a confi¬ 
dence interval for the population of stores that were 10 tsubo and 
80 m from a station. Why would she do that? 

Well, it’s possible that Kazami Bakery could open another 
10-tsubo store that’s also 80 m from a train station. If the chain 
keeps growing, there could be dozens of Kazami shops that fit that 
description. When Risa did that analysis, she was assuming that 
more 10-tsubo stores 80 m from a station might open someday. 

The usefulness of this assumption is disputable. Yumenooka 
Shop has more sales than any other shop, so maybe the Kazami 
family will decide to open more stores just like that one. However, 
the bakery’s next store, Isebashi Shop, will be 10 tsubo but 110 m 
from a station. In fact, it probably wasn’t necessary to analyze such 
a specific population of stores. Risa could have skipped from calcu¬ 
lating adjusted R 2 to making the prediction, but being a good friend, 
she wanted to show Miu all the steps. 


1. Remember that 1 tsubo is about 36 square feet. 
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£TANPARPIZEP RESIPUALS 


As in simple regression analysis, we calculate standardized resi¬ 
duals in multiple regression analysis when assessing how well the 
equation fits the actual sample data that’s been collected. 

Table 3-1 shows the residuals and standardized residuals for 
the Kazami Bakery data used in this chapter. An example calcula¬ 
tion is shown for the Misato Shop. 


TABLE 3-1: ETANPARPIZEP RESIPUALS OF THE KAZAMI BAKERY EXAMPLE 


Bakery 

Floor 

area 

of the 

shop 

Xi 

Distance 

to the 

nearest 

station 

*2 

Monthly 

sales 

y 

Monthly sales 
y = 41.5XJ - 0.3x 2 + 65.3 

Residual 

yy 

Standardized 

residual 

Yumenooka Shop 

10 

80 

469 

453.2 

15.8 

0.8 

Terai Station Shop 

8 

0 

366 

397.4 

-31.4 

-1.6 

Sone Shop 

8 

200 

371 

329.3 

41.7 

1.8 

Hashimoto 

Station Shop 

5 

200 

208 

204.7 

3.3 

0.2 

Kikyou Town Shop 

7 

300 

246 

253.7 

-7.7 

-0.4 

Post Office Shop 

8 

230 

297 

319.0 

-22.0 

1.0 

Suidobashi 

Station Shop 

7 

40 

363 

342.3 

20.7 

1.0 

Rokujo 

Station Shop 

9 

0 

436 

438.9 

-2.9 

-0.1 

Wakaba 

Riverside Shop 

6 

330 

198 

201.9 

-3.9 

-0.2 

Misato Shop 

9 

180 

364 

377.6 

-13.6 

-0.6 


If a residual is positive, the measurement is higher than pre¬ 
dicted by our equation, and if the residual is negative, the measure¬ 
ment is lower than predicted; if it’s 0, the measurement and our 
prediction are the same. The absolute value of the residual tells us 
how well the equation predicted what actually happened. The larger 
the absolute value, the greater the difference between the measure¬ 
ment and the prediction. 


ETANPARPIZEP RESIPUALS 143 






If the absolute value of the standardized residual is greater 
than 3, the data point can be considered an outlier . Outliers are 
measurements that don’t follow the general trend. In this case, 
an outlier could be caused by a store closure, by road construc¬ 
tion around a store, or by a big event held at one of the bakeries— 
anything that would significantly affect sales. When you detect an 
outlier, you should investigate the data point to see if it needs to be 
removed and the regression equation calculated again. 


MAHALANOBIS PISTANCE 

The Mahalanobis distance was introduced in 1936 by mathe¬ 
matician and scientist P.C. Mahalanobis, who also founded the 
Indian Statistical Institute. Mahalanobis distance is very useful 
in statistics because it considers an entire set of data, rather than 
looking at each measurement in isolation. It’s a way of calculating 
distance that, unlike the more common Euclidean concept of dis¬ 
tance, takes into account the correlation between measurements 
to determine the similarity of a sample to an established data set. 
Because these calculations reflect a more complex relationship, 
a linear equation will not suffice. Instead, we use matrices, which 
condense a complex array of information into a more manage¬ 
able form that can then be used to calculate all of these distances 
at once. 

On page 137, Risa used her computer to find the prediction 
interval using the Mahalanobis distance. Let’s work through that 
calculation now and see how she arrived at a prediction interval of 
¥3,751,000 and ¥5,109,000 at a confidence level of 95%. 

STEP1 

Obtain the inverse matrix of 


fS n 

S 12 • 



fS n 

S 12 • 

- S !pl 

-1 

"s 11 

s 12 • 

• ■ s lp " 

S 21 

S 22 

•• s 2p 

, which is 

S 21 

S 22 

•• s 2p 

= 

s 21 

s 22 • 

• ■ s 2p 

V S P1 

S p2 • 

•• s ppj 


v S pi 

S P2 • 

•• s ppj 


v s pl 

s p2 • 

• ■ s pp y 


The first matrix is the covariance matrix as calculated on 
page 132. The diagonal of this matrix (S n , S 22 , and so on) is the vari¬ 
ance within a certain variable. 
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The inverse of this matrix, the second and third matrices shown 
here, is also known as the concentration matrix for the different 
predictor variables: floor area and distance to the nearest station. 

For example, S 2 2 is the variance of the values of the distance 
to the nearest station. S 2 5 would be the covariance of the distance to 
the nearest station and some fifth predictor variable. 

The values of S u and S 22 on page 132 were obtained through 
this series of calculations. 

The values of and Sy in 


S n 

®12 

- s i P 

s 21 

^22 

^ 2 p 

S P1 

^p 2 

" S PP 


and the values of Su and Sy obtained from conducting individual 
tests of the partial regression coefficients are always the same. 
That is, the values of Su and Sy found through partial regression 
will be equivalent to the values of and Sy found by calculating 
the inverse matrix. 

eiepz 

Next we need to calculate the square of Mahalanobis distance for a 
given point using the following equation: 

I>l(*)-(*-*) r (S-)(*-*) 


The x values are taken from the predictors, x is the mean of 
a given set of predictors, and S 1 is the concentration matrix from 
Step 1. The Mahalanobis distance for the shop at Yumenooka is 
shown here: 


D 2 


D 2 


(*1 - *1 ) (*i - * 1 ) ■S U + (*1 - *1 ) (*2 - *2 )'S 12 + • ■''• + (*, - ) {x„ - X p ) S lp 

+ (*2 “ ) (*i “ *1 ) S 21 + (x 2 - x 2 ) (x 2 - x 2 ) S 22 + - • + (x 2 - x 2 ) (x p - x p ) S 2p 


+ ( X P ~ X p)( X l _X l)® Pl + ( X p ~ X p)( X 2 -X 2 )S ?2 +'"+( X p ~ X p){ X p -X p)S‘ 

(xi - Xi) (x, - Xi) S 11 + (x ; - Xi) (x 2 - x 2 ) S 12 
+ (x 2 - x 2 ) (xi - x,) S 21 + (x 2 - x 2 ) (x 2 - x 2 ) S : 

(10 - 7 . 7 ) (10 - 7 . 7 ) X 0.0657 + (10 - 7 . 7 ) (80 - 156 ) x 0.0004 
+ (80 - 156 ) (10 - 7 . 7 ) x 0.0004 + (80 - 156 ) (80 - 156 ) x 0.00001 
2.4 


(number of individuals -1) 


(number of individuals -1) 
( 10 - 1 ) 
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ST5P3 

Now we can calculate the confidence interval, as illustrated here: 


This is the confidence interval. 



Monthly sales 

-> 

453.2 - 35 = 418 a i x 10 + a 2 x 80 + b 453 + 35 = 488 

= 41.5 x 10 - 0.3 x 80 + 65.3 
= 453 



The minimum value of the confidence interval is the same distance from 
the mean as the maximum value of the interval. In other words, the confidence 
interval “straddles” the mean equally on each side. We calculate the distance 
from the mean as shown below (D 2 stands for Mahalanobis distance, and x rep¬ 
resents the total number of predictors, not a value of some predictor): 


IF (l, sample size -x-1; .05) > 


D 2 


sample size sample size -1 


sample size -x-1 



4173.0 
x- 

10 - 2-1 


= 35 


As with simple regression analysis, when obtaining the predic¬ 
tion interval, we add 1 to the second term: 


IF (l, sample size -x-1; .05) > 


1 + 


D 


sample size sample size -1 


sample size -x-1 


If the confidence rate is 99%, just change the .05 to .01: 

F (1, sample size -x-1; .05) = F (l, 10 - 2 -1; .05) = 5.6 
F (1, sample size -x-1; .01) = F (l, 10 - 2 -1; .01) = 12.2 

You can see that if you want to be more confident that the pre¬ 
diction interval will include the actual outcome, the interval needs 
to be larger. 
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USIN6 CK[teOV\a\l PATA in multiple secession analysis 


Recall from Chapter 1 that categorical data is data that can’t be 
measured with numbers. For example, the color of a store manager’s 
eyes is categorical (and probably a terrible predictor variable for 
monthly sales). Although categorical variables can be represented 
by numbers (1 = blue, 2 = green), they are discrete—there’s no such 
thing as “green and a half.” Also, one cannot say that 2 (green eyes) 
is greater than 1 (blue eyes). So far we’ve been using the numeri¬ 
cal data (which can be meaningfully represented by continuous 
numerical values—110 m from the train station is further than 
109.9 m) shown in Table 3-2, which also appears on page 113. 


TABLE 3-2: KAZAMI BAKEPY EXAMPLE PATA 


Bakery 

Floor space 
of the shop 
(tsubo) 

Distance to the 

nearest station 

(meters) 

Monthly sales 
(¥10,000) 

Yumenooka Shop 

10 

80 

469 

Terai Station Shop 

8 

0 

366 

Sone Shop 

8 

200 

371 

Hashimoto Station Shop 

5 

200 

208 

Kikyou Town Shop 

7 

300 

246 

Post Office Shop 

8 

230 

297 

Suidobashi Station Shop 

7 

40 

363 

Rokujo Station Shop 

9 

0 

436 

Wakaba Riverside Shop 

6 

330 

198 

Misato Shop 

9 

180 

364 


The predictor variable floor area is measured in tsubo, distance 
to the nearest station in meters, and monthly sales in yen. Clearly, 
these are all numerically measurable. In multiple regression analy¬ 
sis, the outcome variable must be a measurable, numerical variable, 
but the predictor variables can be 

# all numerical variables, 

# some numerical and some categorical variables, or 

# all categorical variables. 

Tables 3-3 and 3-4 both show valid data sets. In the first, 
categorical and numerical variables are both present, and in the 
second, all of the predictor variables are categorical. 
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TABLE 3-3: A COMBINATION OF CATEGORICAL ANP NUMERICAL PATA 


Bakery 

Floor space 
of the shop 
(tsubo) 

Distance to the 

nearest station 

(meters) 

Free 

samples 

Monthly sales 
(¥10,000) 

Yumenooka Shop 

10 

80 

1 

469 

Terai Station Shop 

8 

0 

0 

366 

Sone Shop 

8 

200 

1 

371 

Hashimoto Station Shop 

5 

200 

0 

208 

Kikyou Town Shop 

7 

300 

0 

246 

Post Office Shop 

8 

230 

0 

297 

Suidobashi Station Shop 

7 

40 

0 

363 

Rokujo Station Shop 

9 

0 

1 

436 

Wakaba Riverside Shop 

6 

330 

0 

198 

Misato Shop 

9 

180 

1 

364 


In Table 3-3 we’ve included the categorical predictor variable 
free samples . Some Kazami Bakery locations put out a tray of free 
samples (1), and others don’t (0). When we include this data in the 
analysis, we get the multiple regression equation 

y = 30.6x 1 - 0Ax 2 + 39.5x s + 135.9 


where y represents monthly sales, represents floor area, x 2 
represents distance to the nearest station, and x 3 represents free 
samples. 


TABLE 3-4: CATEGORICAL PREPICTOR PATA ONLY 


Bakery 

Floor space 
of the shop 
(tsubo) 

Distance to the 

nearest station 

(meters) 

Samples 
every day 

Samples on 

the weekend 

only 

Monthly sales 
(¥10,000) 

Yumenooka Shop 

1 

0 

1 

0 

469 

Terai Station Shop 

1 

0 

0 

0 

366 

Sone Shop 

1 

1 

1 

0 

371 

Hashimoto Station Shop 

0 

1 

0 

0 

208 

Kikyou Town Shop 

0 

1 

0 

0 

246 

Post Office Shop 

1 

1 

0 

0 

297 

Suidobashi Station Shop 

0 

0 

0 

0 

363 

Rokujo Station Shop 

1 

0 

1 

1 

436 

Wakaba Riverside Shop 

0 

1 

0 

0 

198 

Misato Shop 

1 

0 

1 

1 

364 


t t t t 

Less than 8 tsubo = 0 Less than 200 m = 0 Does not offer samples = 0 

8 tsubo or more = 1 200 m or more = 1 Offers samples = 1 


148 CHAPTER 3 MULTIPLE REGRESSION ANALYSIS 








In Table 3-4, we’ve converted numerical data (floor space and 
distance to a station) to categorical data by creating some general 
categories. Using this data, we calculate the multiple regression 
equation 


9 = 50.2x^110.1*2+13.4x3+75.1x4+336.4 

where y represents monthly sales, x ± represents floor area, x 2 
represents distance to the nearest station, x 3 represents samples 
every day, and x 4 represents samples on the weekend only. 


MULTICOLLINEARITY 

Multicollinearity occurs when two of the predictor variables are 
strongly correlated with each other. When this happens, it’s hard to 
distinguish between the effects of these variables on the outcome 
variable, and this can have the following effects on your analysis: 

# Less accurate estimate of the impact of a given variable on the 
outcome variable 

# Unusually large standard errors of the regression coefficients 

# Failure to reject the null hypothesis 

# Overfitting , which means that the regression equation describes 
a relationship between the outcome variable and random error, 
rather than the predictor variable 

The presence of multicollinearity can be assessed by using an 
index such as tolerance or the inverse of tolerance, known as the 
variance inflation factor (VIF). Generally, a tolerance of less than 
0.1 or a VIF greater than 10 is thought to indicate significant multi¬ 
collinearity, but sometimes more conservative thresholds are used. 

When you’re just starting out with multiple regression analysis, 
you don’t need to worry too much about this. Just keep in mind 
that multicollinearity can cause problems when it’s severe. There¬ 
fore, when predictor variables are correlated to each other strongly, 
it may be better to remove one of the highly correlated variables 
and then reanalyze the data. 


DETERMINING THE RELATIVE INFLUENCE OF PREDICTOR 
VARIABLES ON THE OUTCOME VARIABLE 

Some people use multiple regression analysis to examine the rela¬ 
tive influence of each predictor variable on the outcome variable. 
This is a fairly common and accepted use of multiple regression 
analysis, but it’s not always a wise use. 


MULTICOLLINEARITY W 



The story below illustrates how one researcher used multiple 
regression analysis to assess the relative impact of various factors 
on the overall satisfaction of people who bought a certain type of 
candy. 

Mr. Torikoshi is a product development researcher in a confec¬ 
tionery company. He recently developed a new soda-flavored candy, 
Magic Fizz, that fizzes when wet. The candy is selling astonishingly 
well. To find out what makes it so popular, the company gave free 
samples of the candy to students at the local university and asked 
them to rate the product using the following questionnaire. 


MA(3I C FIZZ QUESTIONNAIRE 

Please let us know what you thought of Magic Fizz by 
answering the following questions. Circle the answer that 
best represents your opinion. 


Flavor 

1. Unsatisfactory 

2. Satisfactory 

3. Exceptional 

Texture 

1. Unsatisfactory 

2. Satisfactory 

3. Exceptional 

Fizz sensation 

1. Unsatisfactory 

2. Satisfactory 

3. Exceptional 

Package design 

1. Unsatisfactory 

2. Satisfactory 

3. Exceptional 

Overall satisfaction 

1. Unsatisfactory 

2. Satisfactory 

3. Exceptional 


Twenty students returned the questionnaires, and the results 
are compiled in Table 3-5. Note that unlike in the Kazami Bakery 
example, the values of the outcome variable—overall satisfaction— 
are already known. In the bakery problem, the goal was to predict 
the outcome variable (profit) of a not-yet-existing store based on the 
trends shown by existing stores. In this case, the purpose of this 
analysis is to examine the relative effects of the different predictor 
variables in order to learn how each of the predictors (flavor, tex¬ 
ture, sensation, design) affects the outcome (satisfaction). 
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TABLE 3-5: RESULTS OF MAOIC FIZZ QUESTIONNAIRE 


Respondent 

Flavor 

Texture 

Fizz 

sensation 

Package 

design 

Overall 

satisfaction 

1 

2 

2 

3 

2 

2 

2 

1 

1 

3 

1 

3 

3 

2 

2 

1 

1 

3 

3 

2 

2 

1 

1 

1 

4 

3 

3 

3 

2 

2 

5 

1 

1 

2 

2 

1 

6 

1 

1 

1 

1 

1 

7 

3 

3 

1 

3 

3 

8 

3 

3 

1 

2 

2 

9 

3 

3 

1 

2 

3 

10 

1 

1 

3 

1 

1 

11 

2 

3 

2 

1 

3 

12 

2 

1 

1 

1 

1 

13 

3 

3 

3 

1 

3 

14 

3 

3 

1 

3 

3 

15 

3 

2 

1 

1 

2 

16 

1 

1 

3 

3 

1 

17 

2 

2 

2 

1 

1 

18 

1 

1 

1 

3 

1 

19 

3 

1 

3 

3 

3 

20 

3 

3 

3 

3 

3 


Each of the variables was normalized before the multiple 
regression equation was calculated. Normalization reduces the 
effect of error or scale, allowing a researcher to compare two vari¬ 
ables more accurately. The resulting equation is 

y = OAlx l + 0.32x 2 + 0.26x 3 + 0.1 lx 4 

where y represents overall satisfaction, x x represents flavor, x 2 
represents texture, x 3 represents fizz sensation, and x 4 represents 
package design. 

If you compare the partial regression coefficients for the four 
predictor variables, you can see that the coefficient for flavor is the 
largest. Based on that fact, Mr. Torikoshi concluded that the flavor 
has the strongest influence on overall satisfaction. 

Mr. Torikoshi’s reasoning does make sense. The outcome vari¬ 
able is equal to the sum of the predictor variables multiplied by 
their partial regression coefficients. If you multiply a predictor vari¬ 
able by a higher number, it should have a greater impact on the final 
tally, right? Well, sometimes—but it’s not always so simple. 
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Let’s take a closer look at Mr. Torikoshi’s reasoning as 
depicted here: 



In other words, he is assuming that all the variables relate 
independently and directly to overall satisfaction. However, this is 
not necessarily true. Maybe in reality, the texture influences how 
satisfied people are with the flavor, like this: 



Structural equation modeling (SEM) is a better method for 
comparing the relative impact of various predictor variables on an 
outcome. This approach makes more flexible assumptions than 
linear regression does, and it can even be used to analyze data sets 
with multicollinearity. However, SEM is not a cure-all. It relies on 
the assumption that the data is relevant to answering the ques¬ 
tion asked. 

SEM also assumes that the data is correctly modeled. It’s worth 
noting that the questions in this survey ask each reviewer for a 
subjective interpretation. If Miu gave the candy two “satisfactory” 
and two “exceptional” marks, she might rate her overall satisfac¬ 
tion as either “satisfactory” or “exceptional.” Which rating she 
picks might come down to what mood she is in that day! 

Risa could rate the four primary categories the same as Miu, 
give a different overall satisfaction rating from Miu, and still be 
confident that she is giving an unbiased review. Because Miu and 
Risa had different thoughts on the final category, our data may 
not be correctly modeled. However, structural equation modeling 
can still yield useful results by telling us which variables have an 
impact on other variables rather than the final outcome. 
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we ueep eimple 
REGR eeeioN analyeie 
CANP MULTIPLE 

REGReeeioN analyeie) 

TO PREPICT THE VALUE OF 
AN OUTCOME VARIABLE. 

REMEMBER HOW WE 
PREPICTEP THE NUMBER 
OF ICEP TEA EALEE? 



BINOMIAL LOOIBTIC 
RBORBBBION ANALYZE 
IE A LITTLE PIFFERENT. 


EO WHAT'E IT ' 
UEEP FOR? 1 



THE PROBABILITY OF JOHN 
GETTING APMITTBP TO 
HARVARP UNIVERSITY 

THE PROBABILITY OF 
WINNING THE LOTTERY 



^ IT'E UEEP 
r TO PREPICT 
PROBABILITIES 
WHETHER 
OR NOT 
EOMETHING 
WILL HAPPEN/ 



you got IT. 

PROBABILITY ARE 
CALCULATEP AE A 
PERCENTAGE, WHICH 
IE A VALUE BETWEEN 
ZERO ANP 1. 



THE FINAL LESSON 157 


































OUTCOME PREPICTOR REOREBBION INTERCEPT 

VARIABLE Op VARIABLE Cx) COEFFICIENT 
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NO MATTER WHAT 
Z IS, THE VALUE OF 
y IS NEVER GREATER 
THAN 1 OR LESS 
THAN ZERO. 



// YEAH/ IT LOOKS 
I LIKE THE S WAS 
i SMOOSHEP 
IV TO FIT. v 


NOW, BEFORE WE CAN 
GO ANY FURTHER WITH 
LOGISTIC REGRESSION 
ANALYSIS, YOU NEEP TO 
UNPERSTANP MAXIMUM 
UKBUHOOP. 



MAXIMUM LIKELIHOOP IS 
USEP TO ESTIMATE THE 
VALUES OF PARAMETERS 
OF A POPULATION USING A 
REPRESENTATIVE SAMPLE. 
THE ESTIMATES ARE MAPE 
BASEP ON PROBABILITY. 


MAXIMUM 

UKEUHOOP 


MOKE 

PROBABILITY! 


TO EXPLAIN, I'LL 
USE A HYPOTHETICAL 
SITUATION 
STARRING US/ 



I PONT KNOW 
IF I'M CUT OUT 
TO BE A STAR. 
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THE MAXIMUM UKEUHOOP METHOP 


..WE WENT TO 


.. 

m UNIFORME. 



WHAT?/ 


THEN WE 

RANPOMLY PI6KEP 
10 ETUPENTE ANP 
AEKEP IF THEY 
LIKE THE UNIFORM. 


WHAT VO you 
think of this, 

UNIFORM? 


THAT'P EE EO 
EMBARRAEEING 



HERE ARE THE 
IMAGINARY REEULTE. 


WOW/ MOET 
PEOPLE SEEM 
TO LIKE OUR 
UNIFORM. 


STUPENT 


PO YOU LIKE THE 
NORNE UNIFORM? 




IF THE POPULARITY 
OF OUR UNIFORME 
THROUGHOUT THE 
ENTIRE ETUPENT BOPY 
IE THE PARAMETER p... 
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...THEN THE PROBABILITY 
BASEP ON THE IMAGINARY 
SURVEY RESULTS IS THIS: 


yes no yes no yes yes yes yes no yes 

px(t-p)*px(/-p)xpxp *p*f>x((-p)xp 
= pY / — p/ 

I V f r / ( EQUATION? ) 


YES, WE SOLVE IT BY 
FINPINO THE MOST 
LIKELY VALUE OF p. 



fu-p? 

^ - - \ 

OR 

f WE USE ONE OF \ 

Icxftp’a-pfj 

( THESE LIKELIHOOP ) 

\ FUNCTIONS. J 

’ 

Z' EITHER WAXX 
( THE ANSWER IS ) 
THE SAME, y 


‘ TAKING THE LOS OF THIS FUNCTION CAN MAKE LATER CALCULATIONS EASIER. 
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THAT'S RIGHT. WE TAKE THE LOG 
OF THIS FUNCTION BECAUSE IT 
MAKES IT EASIER TO CALCULATE THE 
PERIVATIVE, WHICH WE NEEP TO FINP 
THE MAXIMUM LIKELIHOOP. 



LIKBLIH °°e function 

h {(>*«-& j 

*£&&»**» 



IN THE (3RAPHS, THIS 
PEAK IS THE VALUE OF 
p THAT MAXIMIZES THE 
VALUE OF THE FUNCTION. 
IT'S CALLEP THE 
MAXIMUM UK5UHOOP 
ESTIMATE. 



A 


{Y^ 


ov o- 


...SINCE THE FUNCTIONS 
PEAK AT THE SAME 
PLACE, EVEN THOUOH 
THEY HAVE A PIFFERENT 
SHAPE, THEY OIVE US 
THE SAME ANSWER. 



NOW, LET'S REVIEW THE 
MAXIMUM LIKELIHOOP 
ESTIMATE FOR THE 
POPULARITY OF OUR 
v UNIFORMS. y 
_ 
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FINPIN6 THE MAXIMUM IlfcEUHOOP 
USIN<3 THE UKEUHOOP FUNCTION 




Find the likelihood function. Here, p stands for Yes, and 1 - p 
stands for No. There were 7 Yeses and 3 Nos. 

px(l-p)xpx(l-p)xpxpxpxpx(l-p)xp 

=p 7 (l-p) 3 

Obtain the log-likelihood function and rearrange it. 

L = log{p 7 (l-p) 3 } 

= log p 7 + log (l — p) ◄ -Take the log of each component. 

= 7 log p + 3 log (1 — p) M -Use the Exponentiation Rule from page 22. 


WE'LL USE L TO MEAN THE LOO-LIKELIHOOP 
FUNCTION F£OM NOW ON. 





Differentiate L with respect to p and set the expression equal 
to 0. Remember that when a function’s rate of change is 0, we’re 
finding the maxima. 

dL „ 1 „ 1 / , x „ 1 _ 1 

— = 7x — + 3x-x(-l) = 7x-3x-= 0 

dp p 1-p p 1 - p 

Rearrange the equation in Step 3 to solve for p. 

0 


„ 1 „ 1 

7 x 3 x- 

P 1-P 

^ 1 o 1 A 

7 x-3x- 

p 1-p 

7(l-p)-3p = 0 
7 - 7p - 3p = 0 
7 -lOp = 0 


P 10 


; p(l-p) = Oxp(l-p) 


Multiply both sides of 
the equation by p(l - p). 



x ANP HERE'6 THE 
MAXIMUM UIKEUIHOOP 
ESTIMATE.' 


YEP, 70%. 


>r\ 
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cwooeiwe fizzpictor variables 

































NORNS POESN'T SELL A 
SPECIAL CAKE EVERY PAY. 
PEOPLE ONLY BUY IT 
FOR A REALLY SPECIAL 
OCCASION, LIKE A 
BIRTH PAY. 


UN50LP a Sf *°LP 



SO TOPAY WE'RE GOING TO 
FINP A LOGISTIC REGRESSION 
EQUATION TO PREPICT 
WHETHER THE NORNS SPECIAL 
WILL SELL ON A GIVEN PAY. 



OH BOY/ THIS IS 
REALLY EXCITING/ 



Ill 



THAT'S A GREAT ) 
QUESTION. J 


I'VE BEEN TRYING 
TO FIGURE THAT 
OUT FOR A WHILE. 
I'VE NOTICEP THAT 
MORE PEOPLE SEEM 
TO BUY THE NORNS 
SPECIAL WHEN THE 
TEMPERATURE IS HIGH, 
ANP ON WEPNESPAYS, 
SATURPAYS, ANP 
SUNPAYS. 
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WE UEEP 1 TO MEAN 
eolp ANP O TO 
MEAN UNEOLP... 


I - EOLP 
0 = UNEOLP 


...WHICH 15 HOW 
WE REPREEENT 
CATEGORICAL 
PATA A <9 

NUMBERE, RIGHT? 




WELL, IN LOGIETIC REGREEEION ANALYEIE, 
THEEE NUMBERE AREN'T JUET LABELE- 
THEY ACTUALLY MEAEURE THE PROBABILITY 
THAT THE CAKE WAE EOLP. THAT'E 
BECAUEE 1 MEANE 1 00% ANP 0 MEANE 0%. 


OH/ EINCE WE 
KNOW IT WAE EOLP, 
THERE'E A 100% 
PROBABILITY THAT 
IT WAE EOLP. 



^ WEALEO 
KNOW FOR 
EURE IF IT WAE 
WEPNEEPAY, 
EATURPAY, OR 
EUNPAY. 



IN THIE CAEE, HIGH TEMPERATURE 
IE MEAEURABLE PATA, EO WE 
UEE THE TEMPERATURE, JUET 
LIKE IN LINEAR REGREEEION 
ANALYEIE. CATEGORICAL PATA 
ALEO WORKE IN BAEICALLY 
THE EAME WAY AE IN LINEAR 
REGREEEION ANALYEIE, ANP 
ONCE AGAIN WE CAN UEE ANY 
COMBINATION OF CATEGORICAL 
ANP NUMERICAL PATA. 
































































































































































































































































































3T3P 1: PPAW A SCATTER PLOT OF TH3 PP3PICTOP VAPIA3L35 ANP TH3 
OUTCOM3 VAPIA3L3 TO 533 WH3TH3P TH3Y APP3AP TO 33 P3LAT3P. 
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STEP 2: CALCULATE THE LOGISTIC REGRESSION EQUATION. 


( SO THEN I'LL START 
CALCULATING THE 
LOGISTIC DEGRESSION 
V EQUATION/ y 


(^0 0 




OH, I 
TRIEP 
THAT 
ONCE. 



X WAS UP FOP PAYS, 
SUBSISTING ON COFFEE 
ANP PUPPING. MY 
ROOMMATE FOUNP 
MB PERCHEP ON THE 
COUNTER, WEARING 
FAIRY WINGS... 




NO NEEP. 
WE CAN 
JUST USE MY 
LAPTOP/ 
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«2S3> 

Determine the binomial logistic equation for each sample. 


Wednesday, 



Sales of the 


Saturday, or 

High 

Sales of the 

Norns special 


Sunday 

temperature 

Norns special 

1 


*i 

*2 

y 

^ ~ l + e -( a l*l +a 2* 2 +b) 





1 


0 

28 

1 

^ _|_ g-(a 1 x0+a 2 x28+b) 





1 


0 

24 

0 

_l_ ^-(a 1 x0+a 2 x24+b) 





1 


1 

24 

0 

^ _|_ g-(aiXl+a 2 x24+b) 


Obtain the likelihood function. The equation from Step 1 represents 
a sold cake, and (1 - the equation) represents an unsold cake. 


-(a 1 x0+a 2 x28+b) 


1 + e 


1 + e 


-(a 1 x0+a 2 x24+b) 


X---X- 


-(a 1 xl+a 2 x24+b) 


1 + e 


Sold 


Unsold 


Sold 
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Find the maximum likelihood coefficients. These coefficients 
maximize log-likelihood function L. 


The values are: 5 * 

a Y = 2.44 
a 2 = 0.54 
b =-15.20 



We can plug these values into the likelihood function to calculate L, 
which we’ll use to calculate R 2 . 


-(2.44x0+0.54x28-15.20) 


+ logJl- 


-(2.44x0+0.54x24-15.20) 


+- +1 °ge 


-(2.44x1+0.54x24-15.20) 


*See page 210 for a more detailed explanation of these calculations. 


Calculate the logistic regression equation. 


We fill in the coefficients calculated in Step 4 to get the follow¬ 
ing logistic regression equation: 


-(2.44*! +0.54*2 -15-20) 
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ETEP 3: AEEEEE THE ACCURACY OF THE EQUATION. 



* IN THIS EXAMPLE, WE USE MCFAPPEN'S 
PSEUPO-R 2 FORMULA. 









































































HERE'S THE EQUATION THAT 
WE USE TO CALCULATE R 2 
IN LOGISTIC REGRESSION 
ANALYSIS. 


2 . MAXIMUM VALUE OF LOG-LIKELIHOOP FUNCTION L 


R = I- 


'ht\opu+1\olog'ho-(?u+7lo)loj ('hrt'ho) 



THE n VARIABLES ARE 
A TALLY OF THE CAKES 
THAT ARE SOLP CnJ OR 
UNSOLP Cn 0 ). 


'Hi 


IXo 


THE NUMBER OF PATA POINTS WHOSE 
OUTCOME VARIABLE'S VALUE IS 1 


THE NUMBER OF PATA POINTS WHOSE 
OUTCOME VARIABLE'S VALUE IS O 


ANP HERE'S A 
MORE GENERAL 
PEFINITION. 


I'M STILL NOT \ 

SURE HOW TO USE \ 

THIS EQUATION 
WITH THE NORNS J 
SPECIAL PATA. /pont 

> "/WORRY, ITS 

-- —NOT THAT 

HARP. 




WE JUST FILL IN 
THE NUMBERS 
FOR THE NORNS 
SPECIAL... 



/?*=/- 


= /- 


MAXIMUM VALUE OF LOG-LIKELIHOOP 
FUNCTION L 

*h i ‘h /, ho 'ho 

^£9 


= .itiZZ 


WHOA 
I WASN'T 
EXPECTING 
THAT. 
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HMMM... .36? 
THAT'S LOW, 
ISN'T IT? 



JUST LIKE IN LINEAR 
REGRESSION ANALYSIS, 
A HIGHER i? 2 MEANS 
THE EQUATION IS MORE 
ACCURATE. 


BUT THERE'S NO SET 
RULE FOR HOW HIGH 
R 2 NEEP5 TO BE, 
RIGHT? 





ANP TO BE FAIR, THE R 2 
IN LOGISTIC REGRESSION 
ANALYSIS POES TENP TO 
BE LOWER. BUT AN R 2 
AROUNP .4 IS USUALLY A 
PRETTY GOOP RESULT. 



WE'RE NOT SURE YET. 
WE'LL HAVE TO USE A 
PIFFERENT METHOP 
TO FINP OUT. 
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Wednesday, 
Saturday, or 

High temp. 




Sunday 

(°C) 

Actual sales 

Predicted sales 

Day 

*1 

*2 

y 

y 

5 

0 

28 

1 

.51 sold 

6 

0 

24 

0 

.11 unsold 

7 

1 

26 

0 

.80 sold 

8 

0 

24 

0 

.11 unsold 

9 

0 

23 

0 

.06 unsold 

10 

1 

28 

1 

.92 sold 

11 

1 

24 

0 

.58 sold 

12 

0 

26 

1 

.26 unsold 

13 

0 

25 

0 

.17 unsold 

14 

1 

28 

1 

.92 sold 

15 

0 

21 

0 

.02 unsold 

16 

0 

22 

0 

.04 unsold 

17 

1 

27 

1 

.87 sold 

18 

1 

26 

1 

.80 sold 

19 

0 

26 

0 

.26 unsold 

20 

0 

21 

0 

.02 unsold 

21 

1 

21 

1 

.21 unsold 

22 

0 

27 

0 

.38 unsold 

23 

0 

23 

0 

.06 unsold 

24 

1 

22 

0 

.31 unsold 

25 

1 

24 

1 

.58 sold 





t 

1 - 58 




1 + e~ (2 ' 44x 

L+0.54x24-15.20) 



FOR ONE THINS, THE 
NORNE SPECIAL PIP NOT 
SELL ON THE 7TH ANP 
THE 11TH, EVEN THOUSH 
WE PREPICTEP THAT 
IT WOULP. josh . 


.80 sold 
.58 sold 

OREAT/ 
ANYTHING (i 
ELSE? f 


ON THE 12TH ANP THE 
21ST, WE PREPICTEP 
THAT IT WOULPNT SELL, 
BUT IT PIP/ WE CAN SEE 
WHERE THE EQUATION 
WAS WRONS. 



BRILLIANT/ 
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IF WE PIVIPE THE NUMBER OF 
TIMES THE PREPICTION WAS 
WRONG BY THE NUMBER OF 
SAMPLES, LIKE THIS, WE HAVE... 


THE NUMBER OF SAMPLES THAT 
PIPN'T MATCH THE PREPICTION 

TOTAL NUMBER OF SAMPLES 


...THE APPARENT 
ERROR RATE. 


WE'LL GET THE 
ERROR AS A 
PERCENTAGE/ 




r > 

EXACTLY. 

v y 


SO THE APPARENT 
ERROR RATE IN THIS 
CASE IS... 


...mol 



ANP m IS PRETTY LOW, 
WHICH IS GOOP NEWS. 



OH, ANP ONE MORE THING... 
YOU CAN ALSO GET A SENSE 
OF HOW ACCURATE THE 
EQUATION IS BY PRAWING 
A SCATTER PLOT OF 
\ y ANP y. - 


^ CORRELATION COEFFICIENT = .6279 

> I 




THE CORRELATION 
COEFFICIENT IS ALSO USEFUL. 
IT TELLS US HOW WELL THE 
PREPICTEP VALUES MATCH 
ACTUAL SALES. 


THANKS FOR 
PRAWING IT 
THIS TIME. 
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STEP A: COHPUCT THE HYPOTHESIS TESTS. 


AS WE PIP BEFORE, WE 
NEEP TO PO HYPOTHESIS 
TESTING TO SEE IF OUR 
REGRESSION COEFFICIENTS 
ARE SIGNIFICANT. 




COMPREHENSIVE HYPOTHESIS TEST 


NULL a A a 
HYPOTHESIS A l=/H2=U 


ALTERNATIVE ONE OF THE FOLLOWING 
HYPOTHESIS IS TRUE: 

•At^O anp^i^O 

*/4l#0 ANP Al-0 
*Ai- 0 ANP ^2 9*0 


HYPOTHESIS TEST FOR AN INPIVIPUAL 
REGRESSION COEFFICIENT 


NULL 

HYPOTHESIS 


ALTERNATIVE 

HYPOTHESIS 


LIKE THIS. 


Ai-o 


Alt* 0 
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WE'LL. VO THE LIKELIHOOP RATIO TEST. THIS 
TEST LETS US EXAMINE ALL THE COEFFICIENTS 
AT ONCE ANP ASSESS THE RELATIONSHIPS 
AMONO THE COEFFICIENTS. 





THE STEPS OF THE UKEUHOOP RATIO TEST 


Step 1 

Define the populations. 

All days the Norns Special is sold, comparing 
Wednesdays, Saturdays, and Sundays against the 
remaining days, at each high temperature. 

Step 2 

Set up a null hypothesis and 
an alternative hypothesis. 

Null hypothesis is A x = 0 and A 2 = 0. 

Alternative hypothesis is A x * 0 or A 2 * 0. 

Step 3 

Select which hypothesis test 

to conduct. 

We’ll perform the likelihood ratio test. 

Step 4 

Choose the significance level. 

We’ll use a significance level of .05. 

Step 5 

Calculate the test statistic 

from the sample data. 

The test statistic is: 

2[L - n 1 log e (n 1 ) - n 0 log e (n 0 ) + (n x + n 0 )log e (n 1 + n„)] 

When we plug in our data, we get: 

2[-8.9010 - 81og e 8 - 131og e 13 + (8 + 13)log e (8 + 13)] 

= 10.1 

The test statistic follows a chi-squared distribu¬ 
tion with 2 degrees of freedom (the number of pre¬ 
dictor variables), if the null hypothesis is true. 

Step 6 

Determine whether the 

p-value for the test statistic 
obtained in Step 5 is smaller 
than the significance level. 

The significance level is .05. The value of the 
test statistic is 10.1, so the p-value is .006. Finally, 

.006 < .05.* 

Step 7 

Decide whether you can reject 
the null hypothesis. 

Since the p-value is smaller than the significance 
level, we reject the null hypothesis. 


* How to obtain the p-value in a chi-squared distri¬ 
bution is explained on page 205. 
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r 


NEXT, WE'LL UEE THE WALP TEET TO EEE 
WHETHER EACH OF OUR PREPICTOR VARIAELEE 
HAE A EIONIFICANT EFFECT ON THE EALE OF 
THE NORNE EPECIAU. WE'LL PEMONETRATE 
UEINO PAYE OF THE WEEK. 


THE ETEPE OF THE WALP TEET 



Step 1 

Define the population. 

All days the Norns Special is sold, comparing 
Wednesdays, Saturdays, and Sundays against the 
remaining days, at each high temperature. 

Step 2 

Set up a null hypothesis and 
an alternative hypothesis. 

Null hypothesis is A = 0. 

Alternative hypothesis is A * 0. 

Step 3 

Select which hypothesis test 

to conduct. 

Perform the Wald test. 

Step 4 

Choose the significance level. 

We’ll use a significance level of .05. 

Step 5 

Calculate the test statistic 

from the sample data. 

The test statistic for the Wald test is 

S„ 

In this example, the value of the test statistic is: 
2.44 2 

= 3.9 

1.5475 

The test statistic will follow a chi-squared 
distribution with 1 degree of freedom, if the null 
hypothesis is true. 

Step 6 

Determine whether the 

p-value for the test statistic 
obtained in Step 5 is smaller 
than the significance level. 

The value of the test statistic is 3.9, so the p-value 
is .0489. You can see that .0489 < .05, so the p-value 

is smaller. 

Step 7 

Decide whether you can reject 
the null hypothesis. 

Since the p-value is smaller than the significance 
level, we reject the null hypothesis. 


IN EOME REFERENCED THIE PROCEEE IE 
EXPLAINEP UEINO NORMAL PIETRIEUTION 
INETEAP OF CHI-EQUAREP PIETRIEUTION. THE 
FINAL REEULT WILL EE THE EAME NO MATTER 
WHICH METHOP YOU CHOOEE. 
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This is how we calculate the standard error matrix. The values of this 
matrix are used to calculate the Wald test statistic in Step 5 on page 180. 



' 0 0 
28 24 


... i> 

••• 24 


0.51x0.49 

0 


V 


0 


0 

0.11x0.89 

0 


0 

"0 

28 

1^ 


0 

0 

24 

1 


0.58 x 0.42 y 

,1 

24 

b 



i ' 


1 These Is represent an 1 
| immeasurable con- [ 
i stant. In other words, , 
i they are a placeholder, i 


1.5388 . 

= ••• f 0.881 

[ | 

S n in Step 5 is this... 


-1 

...and this is S 22 * 



ioe\ei\c Re&izezzioN analysis in action/ isi 

































































































* THIS CALCULATION WAS MAPS USINO ROUNPEP NUMBERS. IF YOU USE THE FULL, UNROUNPEP NUMBERS, THE RESULT WILL BE .44. 
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loetevc reores^ion analysis in action/ is3 
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WHY PIPNT 
YOU JUST 
TBLL MB? 
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I'M HI ROTO 
PUKAZAWA. 
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LOGISTIC REGRESSION ANALYSIS IN THE REAL WORLP 


On page 68, Risa made a list of the all the steps of regression analy¬ 
sis, but later it was noted that it’s not always necessary to perform 
each of the steps. For example, if we’re analyzing Miu’s height over 
time, there’s just one Miu, and she was just one height at a given 
age. There’s no population of Miu heights at age 6, so analyzing the 
“population” wouldn’t make sense. 

In the real world too, it’s not uncommon to skip Step 1, draw¬ 
ing the scatter plots—especially when there are thousands of 
data points to consider. For example, in a clinical trial with many 
participants, researchers may choose to start at Step 2 to save 
time, especially if they have software that can do the calculations 
quickly for them. 

Furthermore, when you do statistics in the real world, don’t just 
dive in and apply tests. Think about your data and the purpose of 
the test. Without context, the numbers are just numbers and sig¬ 
nify nothing. 


LOGIT, OPPS RATIO, ANP RELATIVE RISK 

Odds are a measure that suggests how closely a predictor and 
an outcome are associated. They are defined as the ratio of the 
probability of an outcome happening in a given situation (y) to 
the probability of the outcome not happening (1 - y): 

y 

i-y 


ioe\j 

The logit is the log of the odds. The logistic function is its inverse, 
taking a log-odds and turning it into a probability. The logit is math¬ 
ematically related to the regression coefficients: for every unit of 
increase in the predictor, the logit of the outcome increases by the 
value of the regression coefficient. 

The equation for the logistic function, which you saw earlier 
when we calculated that logistic regression equation on page 170, 
is as follows: 


y = 


1 

1 + e z 


where z is the logit and y is the probability. 
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To find the logit, we invert the logistic equation like this: 


log ^ = z. 

i-y 

This inverse function gives the logit based on the original logis¬ 
tic regression equation. The process of finding the logit is like find¬ 
ing any other mathematical inverse: 


y 

y x (e z +1) 
y x e z + y 
y 
y 


1 

yx -— 

lxy 


log 


y 

i-y 

y 

i -y 


1 e z e z 

■ X - 


1 + e z 1 + e z e z e z + 1 


e + 
e z 


-x(e Z + l) MULTIPLY BOTH SIPE OF THE EQUATION BY (e z + l). 

■ 


e z -yxe z transpose tepme. 
(l-y)e- 




MULTIPLY BOTH EIPE OF THE EQUATION BY 


i-y' 


log e z = z 


Therefore, the logistic regression equation for selling the Norns 
Special (obtained on page 172), 


+ ~(2.44x 1 +0.54x 2 -15.20) 


can be rewritten as 


log 


y 

i -y 


= 2.44x t + 0.54x 2 


15.20. 


So the odds of selling the Norns Special on a given day, at 
a given temperature are e 2 44x i +0 54 *2-i5 2o an( j the logit is 2.44x! + 
0.54x 2 - 15.20. 


owe RATIO 

Another way to quantify the association between a predictor and an 
outcome is the odds ratio (OR). The odds ratio compares two sets 
of odds for different conditions of the same variable. 
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Let’s calculate the odds ratio for selling the Norns Special on 
Wednesday, Saturday, or Sunday versus other days of the week: 



f sales rate of Wed, Sat, or Sun ^ 


(6/9) " 


[(6/9)1 

^1 - sales rate of Wed, Sat, or Sun J 

1 - (6/9) 

L(3/9)J 

^ sales rate of days other than Wed, Sat, or Sun N 


' (2/12) 


(2/12) ' 

1 - sales rate of days other than Wed, Sat, or Sun y 

_l-(2/12)_ 

(10/12) 


(2/10) 3 10 3 2 


This shows that the odds of selling the Norns special on one 
of those three days are 10 times higher than on the other days of 
the week. 


APJUSTEP OPPS RATIO 

So far, we’ve used only the odds based on the day of the week. If we 
want to find the truest representation of the odds ratio, we would 
need to calculate the odds ratio of each variable in turn and then 
combine the ratios. This is called the adjusted odds ratio. For the 
data collected by Risa on page 176, this means finding the odds 
ratio for two variables—day of the week and temperature—at the 
same time. 

Table 4-1 shows the logistic regression equations and odds 
when considering each variable separately and when consider¬ 
ing them together, which we’ll need to calculate the adjusted 
odds ratios. 


TABLE 4-1: THE LOGISTIC PEOPESSION EQUATIONS ANP OP PS FOP THE PATA ON PAOE 176 


Predictor variable 

Logistic regression equation 

Odds 

“Wed, Sat, or Sun” only 

1 

y - 1 + e -(2.30x 1 -1.6l) 

e (2.30x 1 -1.61) 

“High temperature” only 

1 

y ~ l + e -(0.52x 2 -13.44) 

e (0.52x 2 -13.44) 

“Wed, Sat, or Sun” and 
“High temperature” 

1 

y — i + e -(2.44x 1 + 0.54x 2 -15.20) 

e (2.44x! +0.54 x 2 -15.20) 
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The odds of a sale based only on the day of the week are calcu¬ 
lated as follows: 

odds of a sale on Wed, Sat, or Sun _ e 2 30x1 161 

odds of a sale on days other than Wed, Sat, or Sun e 2 30x0 161 

^2.30x1-1.61-(2.30x0-1.61) _ ^2.30 

This is the unadjusted odds ratio for “Wednesday, Saturday, or 
Sunday.” If we evaluate that, we get e 2 30 = 10, the same value we got 
for the odds ratio on page 192, as you would expect! 

To find the odds of a sale based only on temperature, we look at 
the effect a change in temperature has. We therefore find the odds 
of making a sale with a temperature difference of 1 degree calcu¬ 
lated as follows: 

odds of a sale with high temp of (k +1) degrees e °.S2x(k+i)-i3.44 
odds of a sale with high temp of k degrees e o.52xk-i3.44 


e 0.52x(Jc+l)-13.44-(0.52xfc-13.44) _ ^0.52 


This is the unadjusted odds ratio for a one degree increase in 
temperature. 

However, the logistic regression equation that was calculated 
from this data considered both of these variables together, so 
the regression coefficients (and thus the odds ratios) have to be 
adjusted to account for multiple variables. 

In this case, when the regression equation is calculated using 
both day of the week and temperature, we see that both exponents 
and the constant have changed. For day of the week, the coefficient 
has increased from 2.30 to 2.44, temperature increased from 0.52 
to 0.54, and the constant is now -15.20. These changes are due to 
interactions between variables—when changes in one variable 
alter the effects of another variable, for example if the day being 
a Saturday changes the effect that a rise in temperature has on 
sales. With these new numbers, the same calculations are per¬ 
formed, first varying the day of the week: 


^2.44xl+0.54xk-15.20 

^2.44x0+0.54xk-15.20 


= e 


2.44xl+0.54xk-15.20-(2.44x0+0.54xk-15.20) 


= e 2 - 44 


This is the adjusted odds ratio for “Wednesday, Saturday, 
or Sunday.” In other words, the day-of-the-week odds have been 
adjusted to account for any combined effects that may be seen 
when temperature is also considered. 


LOG IT, owe RATIO, ANP RELATIVE RISK K?3 



Likewise, after adjusting the coefficients, the odds ratio for tem¬ 
perature can be recalculated: 

2.44xl+0.54x(k+l)-15.20 2.44x0+0.54x(fc+l)-15.20 

e & _ 0.54x(k+l)-15.20-(0.54xk-15.20) _ 0.54 

^2.44xl+0.54xfc-15.20 — ^2.44x0+0.54xfc-15.20 ~ ^ ~ ^ 

This is the adjusted odds ratio for “high temperature.” In this 
case, the temperature odds ratio has been adjusted to account for 
possible effects of the day of the week. 


HYPOTHESIS TESTING WITH OPPS 

As you’ll remember, in linear regression analysis, we perform a 
hypothesis test by asking whether A is equal to zero, like this: 


Null hypothesis 

A i = 0 

Alternative hypothesis 

0 


In logistic regression analysis, we perform a hypothesis test by 
evaluating whether coefficient A as a power of e equals e°: 


Null hypothesis 

e A ‘ =e° =1 

Alternative hypothesis 

e A ‘ * e° = 1 


Remember from Table 4-1 that e ( 2 30x i 161 ) i s the odds of sell¬ 
ing the Norns Special based on the day of the week. If, instead, the 
odds were found to be e 0Xl-161 , it would mean the odds of selling the 
special were the same every day of the week. Therefore, the null 
hypothesis would be true: day of the week has no effect on sales. 
Checking whether A t = 0 and whether e Ai = e° = 1 are effectively the 
same thing, but because logistic regression analysis is about odds 
and probabilities, it is more relevant to write the hypothesis test in 
terms of odds. 


commence interval for an owe ratio 

Odds ratios are often used in clinical studies, and they’re gener¬ 
ally presented with a confidence interval. For example, if medical 
researchers were trying to determine whether ginger helps to alle¬ 
viate an upset stomach, they might separate people with stomach 
ailments into two groups and then give one group ginger pills and 
the other a placebo. The scientists would then measure the dis¬ 
comfort of the people after taking the pills and calculate an odds 
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ratio. If the odds ratio showed that people given ginger felt better 
than people given a placebo, the researchers could use a confidence 
interval to get a sense of the standard error and the accuracy of the 
result. 

We can also calculate a confidence interval for the Norns Special 
data. Below, we calculate the interval with a 95% confidence rate. 



^2.44-1.96Vl.5388 _ ^2.44+1.9671.5388 


1.0 = 130.5 

If we look at a population of all days that a Norns Special was 
on sale, we can be sure the odds ratio is somewhere between 1 and 
130.5. In other words, at worst, there is no difference in sales based 
on day of the week (when the odds ratio = 1), and at best, there is 
a very large difference based on the day of the week. If we chose a 
confidence rate of 99%, we would change the 1.96 above to 2.58, 
which makes the interval 0.5 to 281.6. As you can see, a higher 
confidence rate leads to a larger interval. 

RELATIVE RISK 

The relative risk (RR), another type of ratio, compares the probabil¬ 
ity of an event occurring in a group exposed to a particular factor to 
the probability of the same event occurring in a nonexposed group. 
This ratio is often used in statistics when a researcher wants to 
compare two outcomes and the outcome of interest is relatively 
rare. For example, it’s often used to study factors associated with 
contracting a disease or the side effects of a medication. 

You can also use relative risk to study something less seri¬ 
ous (and less rare), namely whether day of the week increases the 
chances that the Norns Special will sell. We’ll use the data from 
page 166. 

First, we make a table like Table 4-2 with the condition on one 
side and the outcome on the other. In this case, the condition is 
the day of the week. The condition must be binary (yes or no), so 
since Risa thinks the Norns special sells best on Wednesday, Satur¬ 
day, and Sunday, we consider the condition present on one of those 
three days and absent on any other day. As for the outcome, either 
the cake sold or it didn’t. 


loe IT, owe RATIO, ANP RELATIVE RISK \Q& 



TABLE 4-2: £POEE-TABULATION TABLE OF "WEPNEEPAY, 
EATUPPAY OR EUNPAY" ANP "EALEE OF IslOPNE EPECIAL" 



Sales of Norns Special 

Sum 

Yes 

No 

Wed, Sat, 

Yes 

6 

3 

9 

or Sun 

No 

2 

10 

12 

Sum 

8 

13 

21 


To find the relative risk, we need to find the ratio of Norns 
Specials sold on Wednesday, Saturday, or Sunday to the total num¬ 
ber offered for sale on those days. In our sample data, the number 
sold was 6, and the number offered for sale was 9 (3 were not sold). 
Thus, the ratio is 6:9. 

Next, we need the ratio of the number sold on any other day 
to the total number offered for sale on any other day. This ratio 
is 2:12. 

Finally, we divide these ratios to find the relative risk: 

sales rate of Wed, Sat, or Sun (6/9) 6 2 6 12 2 

-= ^ = — + — = — x — = — x6 = 4 

the sales rate of days other than Wed, Sat, or Sun (2/12) 9 12 9 2 3 


So the Norns Special is 4 times more likely to sell on Wednes¬ 
day, Saturday or Sunday. It looks like Risa was right! 

It’s important to note that often researchers will report the 
odds ratio in lieu of the relative risk because the odds ratio is more 
closely associated with the results of logistic regression analysis 
and because sometimes you aren’t able to calculate the relative risk; 
for example, if you didn’t have complete data for sales rates on all 
days other than Wednesday, Saturday, and Sunday. However, rela¬ 
tive risk is more useful in some situations and is often easier to 
understand because it deals with probabilities and not odds. 
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This appendix will show you how to use Excel functions to calcu¬ 
late the following: 

# Euler’s number (e) 

# Powers 

# Natural logarithms 

# Matrix multiplication 

# Matrix inverses 

# Chi-squared statistic from a p-value 

# p-value from a chi-squared statistic 

# F statistic from a p-value 

# p-value from an F statistic 

# Partial regression coefficient of a multiple regression analysis 

# Regression coefficient of a logistic regression equation 

We’ll use a spreadsheet that already includes the data for the 
examples in this appendix. Download the Excel spreadsheet from 
http://www.nostarch.com/regression/. 

EULER'S NUMBER 

Euler’s number (e), introduced on page 19, is the base number of 
the natural logarithm. This function will allow you to raise Euler’s 
number to a power. In this example, we’ll calculate e 1 . 

1. Go to the Euler's Number sheet in the spreadsheet. 

2. Select cell Bl. 



BI 

- 


■;*r 


A 

B 

C 

1 D 

1 



! 

2 





3. Click Formulas in the top menu bar and select Insert Function. 


jPp| 

Home Insert 

Page Layout Formulas 

Data Review 

View 

fx 

x 6 

i i & a 

6 § 

m 

Insert 

AutoSum Recently Financial Logical Text Datefii 

Lookup & Math 

More 

Function 

- Used T 

T Time- 

Function Library 

Reference T Si. Trig - 

Functions - 
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NOTE 


4. From the category drop-down menu, select Math & Trig. Select 
the EXP function and then click OK. 


Insert Function 


Search for a function: 


Type a brief description of what you want to do and then dick 
Go 


Or select a category: Matin 3.Trig 

Select a function: 

COSH 
DEGREES 
EVEN 


B 


FACT 

FACTOOUELE 
FLOOR _ 

EXP[n umber) 

Returns e raised to the power of a given number. 


Help on this function 


5. You’ll now see a dialog where you can enter the power to which 
you want to raise e. Enter l and then click OK. 


Function Arguments 


] V | S 1 

EXP 

Number l 

(D - 1 



= 2.7182S182S 

Returns e raised to the power of a given number. 

Number is the exponent applied to the base e. Tine constant e equals 
2.7182818284590-4 r the base of the natural logarithm. 

Formula result = 2.718281828 

Help on this function [ OK ] [ Cancel 


Because we’ve calculated Euler’s number to the power of 1, 
you’ll just get the value of e (to a few decimal places), but you can 
raise e to any power using the EXP function. 



El 

- 

£ =EXP(1) 


A 

B 

0 D E 

1 


2.71S2S2 


2 





You can avoid using the Insert Function menu by entering =EXP(X) 
into the cell . For example , entering =EXP(i) will also give you the 
value of e. This is the case for any function: after using the Insert 
Function menu , simply look at the formula bar for the function you 
can enter directly into the cell . 


eviez'e number m 







powers 


This function can be used to raise any number to any power. We’ll 
use the example question from page 14: “What’s 2 cubed?” 

1. Go to the Power sheet in the spreadsheet. 

2. Select cell B1 and type =2 A 3. Press enter. 



Bl 

- (- 

£ | =2*3 


A 

B J 

C | p 

JL 

\F~ 

1_ i 

2 



In Excel, we use the A symbol to mean “to the power of,” so 2 A 3 
is 2 3 , and the result is 8. Make sure to include the equal sign (=) at 
the start or Excel will not calculate the answer for you. 


NATURAL- LOGARITHMS 


This function will perform a natural log transformation (see 
page 20). 

1. Go to the Natural Log sheet in the spreadsheet. 

2. Select cell Bl. Click Formulas in the top menu bar and select 

Insert Function. 

3. From the category drop-down menu, select Math & Trig. Select 
the LN function and then click OK. 


Insert Function 


Search for a function: 


Type a brief description of what you want to do and then dick 

[Go______ 


Or select a category: Math &Tng 


:s 


Select a function: 


GCD 

* 

INT 


LCM 

i—i 


u 

LOG 

LOG 10 

MDETERM 


LNJ number} 


Returns the natural logarithm of a number. 


Help on this function OK I Cancel 
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4. Enter exp(3) and click OK. 



You should get the natural logarithm of e 3 , which, according to 
Rule 3 on page 22, will of course be 3. You can enter any number 
here, with a base of e or not, to find its natural log. For example, 
entering exp(2) would produce 2, while entering just 2 would give 
0.6931. 

MATRIX MULTIPLICATION 

This function is used to multiply matrices—we’ll calculate the mul¬ 
tiplication example shown in Example Problem 1 on page 41. 

1. Go to the Matrix Multiplication sheet in the spreadsheet. 

2. Select cell Gl. Click Formulas in the top menu bar and select 

Insert Function. 

3. From the category drop-down menu, select Math & Trig. Select 
the MMULT function and then click OK. 
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4. Click in the Arrayl field and highlight all the cells of the first 
matrix in the spreadsheet. Then click in the Array2 field and 
highlight the cells containing the second matrix. Click OK. 


MMULT - X ✓ ;Jt:| =MMULT(A1:B2,D1:E2) 


A B C D E F | 6 | H 1 J 

K 

1 

2 


1 2 4 5 |',D1:E2) [ 


3 4 -2 4 


3 



4 



5 


Function Arguments l ^ ^ J 


6 




7 




8 


Arrayl | A1:B2 [gs| = {1,2;3,4> 


9 


Array 2 |D1:E2 [SH - {4,5;-2,4} 


10 


= {0,13;4,31> 


11 



12 


Array 2 is the first array of numbers to multiply and must have the same number of 
columns as Array2 has rows. 


13 



14 




15 


Formula result = 0 


Ifj 



17 


Heto on this function [ OK 1 (««, Cancel | 


18 





19 

_ _ 


5. Starting with Gl, highlight a matrix of cells with the same 
dimensions as the matrices you are multiplying—Gl to H2 
in this example. Then click in the formula bar. 


MMULT ^ X f* =MMULT|Al:B2 r Dl:E2)| 



A B 

C 

D E 

F 

G | H 

1 

1 

I 2 

3 4 


4 5 

-2 4 


=MMULT(j 


2 






3 


6. Press ctrl-shift-enter. The fields in your matrix should fill with 
the correct values. 


L14 



11 





A 

B 

C 

D 

E 

F 

G 

H 1 

1 1 

2 


4 

5 


0 

13 

2 3 

4 


-2 

4 


4 

31 

UU_ 


You should get the same results as Risa gets at the bottom of 
page 41. You can do this with any matrices that share the same 
dimensions. 


MATRIX INVERSION 

This function calculates matrix inverses—we’ll use the example 
shown on page 44. 
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1. Go to the Matrix Inversion sheet in the spreadsheet. 

2. Select cell Dl. Click Formulas in the top menu bar and select 

Insert Function. 

3. From the category drop-down menu, select Math & Trig. Select 
the MINVERSE function and then click OK. 


Insert Function 



Search for a function: 

Type a brief description of what you want to do and then dick j Go 
[Go_ 1 - 

Or select a category: Math &TTjg [Vj 

Select a function: 


LN 

LOG 

LOG 10 

MDETERM 


MINVERSE 

MMULT 


MOD 

■y. 


MINVE RSF( a rray } 

Returns the inverse matrix for the matrix stored in an array. 


Help on this function 


4. Select and highlight the matrix in the sheet—that’s cells A1 to 
B2—and click OK. 


MINVERSE ▼ X ✓ f* =M 1 NVERSE(A1:B2) 


A 


B C 

D 1 E F 

G 

H 1 J 

K 

1 

1 

2 

(A1:B2) 1 




2 

3 

4 





3 







4 







5 






sTT 


6 












7 








8 



Array A1:B2 

= {1,2; 3,4} 


9 





= {-2,1:1.5,-0.5} 


10 



Returns the inverse matrix for the matrix stored in an array. 




11 




Array is a numeric array with an eaual number of rows and columns, either a cell 


U 




range or an array constant. 



13 











Formula result = -2 





15 



Help on this function 



OK | Cancel 


W 















17 







5. Starting with Dl, select and highlight a matrix of cells with the 
same dimensions as the first matrix—in this case, Dl to E2. 
Then click in the formula bar. 



MINVERSE 

W/x =M 1N VER S E( A1: B2) 


A B 

C 

D | E 

F 

1 

1 2 


-M INVERT 


2 

3 4. 




3 
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6. Press ctrl-shift-enter. The fields in your matrix should fill with 
the correct values. 


B23 


*:i 


A 

B C 

D 

LL. 

LU 

1 1 

2 

-2 

1 

2 3 

4 

1.5 

-0.5 

1 3 1 


You should get the same result as Risa does on page 44. You 
can use this on any matrix you want to invert; just make sure the 
matrix of cells you choose for the results has the same dimensions 
as the matrix you’re inverting. 


CALCULATING A CHI-£QUARP£? STATISTIC FROM A P-VALUF 


This function calculates a test statistic from a chi-squared distri¬ 
bution, as discussed on page 54. We’ll use a p-value of .05 and 
2 degrees of freedom. 

1. Go to the Chi-Squared from p-Value sheet in the spreadsheet. 

2. Select cell B3. Click Formulas in the top menu bar and then 
select Insert Function. 

3. From the category drop-down menu, select Statistical. Select 
the CHISQ.INV.RT function and then click OK. 


Insert Function 


Search for a function: 


Type a brief description of what you want to do and then dick 


| Go_ 

Or select a category: Statistical 


B 


Select a function: 


CHISQ.DISr 

CHISQ.DIST._RT —. 

CHISQTNV _ 

CHISQ.TEST 
CONFIDENCE. NORM 

CONFIDENCET_ W. 

CHISQ JKV.RTtproba biirty,,d egfree-dom} 

Returns the inverse of the right-tailed probability of the chi-squared distribution. 


Help on this function 


OK 


4. Click in the Probability field and enter Bi to select the prob¬ 
ability value in that cell. Then click in the Deg_freedom field 
and enter B2 to select the degrees of freedom value. When (Bi,B2) 
appears in cell B3, click OK. 
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You can check this calculation against Table 1-6 on page 56. 

£ALOJLATINO A P-VALUE FPOM A 6HI-EQUAPEP ETATIETIC 

This function is used on page 179 in the likelihood ratio test 
to obtain a p-value. We’re using a test statistic value of 10.1 and 
2 degrees of freedom. 

1. Go to the p-Valuefrom Chi-Squared sheet in the spreadsheet. 

2. Select cell B3. Click Formulas in the top menu bar and select 

Insert Function. 

3. From the category drop-down menu, select Statistical. Select 
the CHISQ.DIST.RT function and then click OK. 
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4. Click in the X field and enter Bl to select the chi-squared value 
in that cell. Then click the Deg_freedom field and enter B2 to 
select the degrees of freedom value. When (Bi,B2) appears in 
cell B3, click OK. 


Function Arguments 


CHISQ.DIST.RT 


X El 

HD = 101 

Deg_freedom E2 

m - ^ 


V I 


= 0,006^09333 


Returns the right-tailed probability of the chi-squared distribution. 


Deg_freedora is the number of degrees of freedom r a number between 1 and 10-^10,. 
excluding 10 A 10. 


Formula result = 0.006409333 
Help on this function 


We get 0.006409, which on page 179 has been rounded down to 
0.006. 


B3 

- (r 

f x =CH IS Q. Dl ST. RT{ B1, B 2} 

A 

B 

W 1 D E 

1 Chi-squared 

10.1 


2 Freedom 

2 


3 j Probability 

0.006409 


4 I 




CALOJLATIN6 AN F STATISTIC FPOM A P-VALUF 

This function gives us the F statistic we calculated on page 58. 

1. Go to the F Statistic from p-Value sheet in the spreadsheet. 

2. Select cell B4. Click Formulas in the top menu bar and select 

Insert Function. 

3. From the category drop-down menu, select Statistical. Select 
the F.INV.RT function and then click OK. 
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Insert Function 


Search for a function: 


Type a brief description of what you want to do and then dick 
Go 


Or selecta category: Statistical 
Select a function: 


F.DIST 

F.DIST.RT 

F.INV 


F.TEST 


FISHER 


FISHERJNV 



FJNV.RT(praba blffty^degfreedom 1/degJFreed om 2 J 

Returns the inverse of the (right-tailed) F probability distribution: if p = 
F. DIST. RT(x,...)j thenF.INV.RT{p,...) = x. 


Help on this function 


OK ~~] | Cancel 


4. Click in the Probability field and enter Bi to select the probabil¬ 
ity value in that cell. Click in the Deg_freedoml field and enter 
B2 and then select the Deg_freedom2 field and enter B3. When 
(Bl,B2,B3) appears in cell B3, click OK. 

Function Arguments I ^ ^ I 

I F.INV.RT 


Probability 

!bi 

m 

Degfreedoml 

| B2 

si 

Deg_freedom2 

| B3 

m 


= 4.747225347 

Returns the inverse of the (right-tailed) F probability distribution: if p = F.DIST.RT(x f ...3 J then F.INV.RT{p,...] = 
it, 

Deg_freedom2 is the denominator degrees of freedom, a number between 1 and 
10 A 10 r exduding 1Q A 1Q. 


Formula result = 4.7472 25 347 

Help on this function [ OK ] [ Cancel 


We get 4.747225, which has been rounded down to 4.7 in 
Table 1-7 on page 58. 



B4 - (* 

£ | =F.INV.RT(Bl r B2 r B3} 


A 

B 1 C | D 

i 

Probability 

0.05 

2 

1 degree of freedom 

1 

3 

2 degrees of freedom 

m 

4 ; 

F 

4.7472251 

5 
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CALCULATING A P-VALUF FPOM AN F STATISTIC 

This function is used on page 90 to calculate the p-value in an 
ANOVA. 

1. Go to the p-Value for F Statistic sheet in the spreadsheet. 

2. Select cell B4. Click Formulas in the top menu bar and select 

Insert Function. 

3. From the category drop-down menu, select Statistical. Select 
the F.DIST.RT function and then click OK. 



4. Click in the X field and enter Bl to select the F value in that 
cell. Click in the Deg_freedoml field and enter B2, and then 
click in the Deg_freedom2 field and enter B3. When (Bi,B2,B3) 
appears in cell B3, click OK. 
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The result, 7.66775E-06, is the way Excel presents the value 
7.66775 x 10 -6 . If we were testing at the p = .05 level, this would be a 
significant result because it is less than .05. 



B4 

fx =F. DIST. RT (Bl, B2, B3) 

^1 

A 

B 

C D 

1 

F 

55.6 


2 

1 degree of freedom 

1 


3 

2 degrees of freedom 

12 


4 

Probability 

7.66775E-06 


5 



PARTIAL RECESSION COEFFICIENT OF A 
MULTIPLE REGRESSION ANALYSIS 


This function calculates the partial regression coefficients for the 
data on page 113, giving the results that Risa gets on page 118. 

1. Go to the Partial Regression Coefficient sheet in the 
spreadsheet. 

2. Select cell G2. Click Formulas in the top menu bar and select 

Insert Function. 

3. From the category drop-down menu, select Statistical. Select 
the LINEST function and then click OK. 



PARTIAL RECESSION COEFFICIENT OF A MULTIPLE REGRESSION ANALYSIS ZOQ 








































4. Click in the Known_y’s field and highlight the data cells for your 
outcome variable—here it’s D2 to Dll. Click in the Known_x’s 
field and highlight the data cells for your predictor variables— 
here B2 to Cll. You don’t need any values for Const and Stats, 
so click OK. 


Function Arguments 
I LINE ST 


Known_v's D2rDll 

iPi - 

{469; 366;3 71; 208; 246; 297;363; 436. 

Khown_K's |b2:C11 

S = 

{10 r 80;8,0;8,2Q0; 5,200;7,300;8,23.. 

Const 

m = 

logical 

Stats 

_ |B| 8 

logical 


= {-0.34088268 5663619,41.513478 25... 

Returns statistics that describe a linear trend matching known data points, by fitting a straight line using the least 
squares method. 

Known_v's is tine set of y values you already know in tine relationship y = mx 4 b. 


Formula result = -0.340882686 

Help on this function | OK | j Cancel 


3. The full function gives you three values, so highlight G1 to II 
and click the function bar. Press ctrl-shift-enter, and the high¬ 
lighted fields should fill with the correct values. 


G2 

fx {=UNEST(D2:D11,B2:C11)} 




A 

B 

C 

D 

E 

F 

G 

H 

1 

J 

1 

Floor 

space 

(tsubo) 

Distiance 

to nearest 

station 

(meters) 

Monthly 

sales 



Distance 

to nearest 

station 

(meters) 

Floor 

space 

(tsubo) 

Constant 

term 


2 Yumenooka Shop 

10 

80 

469 


Partial regression coefficient 

-0.3409 

41.5135 

65.32391 


3 Terai Station Shop 

8 

0 

366 







4 SoneShop 

8 

200 

371 







5 Hashimoto Station Shop 

5 

200 

208 







6 Kikyou Town Shop 

7 

300 

246 







7 Post Office Shop 

8 

230 

297 







8 Suidobashi Station Shop 

7 

40 

363 







9 Rokujo Station Shop 

9 

0 

436 







10 Wakaba Riverside Shop 

6 

330 

198 







11 MisatoShop 

9 

180 

364 







L iL_ 







You can see that the results are the same as Risa’s results on 
page 118 (in the text, they have been rounded). 


REGRESSION COEFFICIENT of a 
LOGISTIC REGRESSION EQUATION 


There is no Excel function that calculates the logistic regression 
coefficient, but you can use Excel’s Solver tool. This example cal¬ 
culates the maximum likelihood coefficients for the logistic regres¬ 
sion equation using the data on page 166. 
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1 . 


2 . 


Go to the Logistic Regression Coefficient sheet in the 
spreadsheet. 

First you’ll need to check whether Excel has Solver loaded. 
When you select Data in the top menu bar, you should see a 
button to the far right named Solver. If it is there, skip ahead 
to Step 4; otherwise, continue on to Step 3. 


[^) (5) Connections 

4i mu 

T 1 °“' 

fcpJ H p H Data Validation - 
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Text to Remove 
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♦ Group ft 
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*—* [fU* Properties 

Get External Refresh 

Data ’ All Edit Links 
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J/ Advanced 

> Ungroup ' “5 s 
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Connections 


Sort & Filter 

Data Tools 

Outline 

Analysis 


3. If the Solver button isn’t there, go to File ► Options ► Add-Ins 
and select the Solver Add-in. Click Go, select Solver Add-in 
in the Add-Ins dialog, and then click OK. Now when you select 
Data in the top menu bar, the Solver button should be there. 



General 

Formulas 

Proofing 

Save 

Language 
Advanced 
Customize Ribbon 
Quick Access Toolbar 


Add-Ins 
Trust Center 


Name * 

Location 

Type 

* 

Active Application Add-ins 




Acrobat PDFMaker Office COM Addin 

GV..FMaker\Office\x64\PDFMOfficeAddin.dll 

COM Add-in 


Inactive Application Add-ins 




Analysis ToolPak 

CV..e\Officel4\Library\Analysis\ANALYS32.XLL 

Excel Add-in 


Analysis ToolPak - VBA 

CV. ,ff i cel4\Lib rary\Analy s i s\ATPVBAEN.XLAM 

Excel Add-in 


Custom XML Data 

CY.AMicrosoft Office\Officel4\OFFRHD.DLL 

Document Inspector 


Date (XML) 

C\...es\Microsoft Shared\Smart Tag\MOFL.DLL 

Action 


Euro Currency Tools 

CV.. Office\Officel4\Library\EUROTOOL.XLAM 

Excel Add-in 


Financial Symbol (XML) 

C\...es\Microsoft Shared\Smart Tag\MOFL.DLL 

Action 


Headers and Footers 

CV..\MicrosoftOffice\Officel4\OFFRHD.DLL 

Document Inspector 

- 

Hidden Rows and Columns 

CV.AMicrosoftOffice\Officel4\OFFRHD.DLL 

Document Inspector 


Hidden Worksheets 

CY.AMicrosoftOffice\Officel4\OFFRHD.DLL 

Document Inspector 


Invisible Content 

C\..AMicrosoftOffice\Officel4\OFFRHD.DLL 

Document Inspector 


Microsoft Actions Pane 3 


XML Expansion Pack 


Document Related Add-ins 




No Document Related Add-ins 




Disabled Application Add-ins 



J 

No Disabled Application Add-ins 





Add-iru Solver Add-in 

Publisher 

Compatibility. No compatibility information available 

Location: C:\Program FilesNMicrosoft Office\Officel4\Library\SOLVER\SOLVER.XLAM 

Description: Tool for optimization and equation solving 


Manage: j Excel Add-in 


4. Click the Solver button. Click in the Set Objective field and 
select cell L3 to select the log likelihood data. Click in the By 
Changing Variable Cells field and select the cells where you 
want your results to appear—in this case L5 to L7. Click Solve. 
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SYMBOLS 

A (delta), 29 
' (prime), 32 

A 

accuracy. See also coeffi¬ 
cient of determination 
of logistic regression 
analysis equation, 
173-177 

of multiple regression 
equation, 119-126 
adding matrices, 39-40 
adjusted odds ratio, 
192-194 

adjusted R 2 , 124-126 
alternative hypothesis 
(H a ), 48 

analysis of variance 

(ANOVA). See also 
hypothesis testing 
logistic regression 

analysis, 178-181 
multiple regression 

analysis, 128-132 
regression analysis, 
87-90 

apparent error rate, 177 
assumptions of normality, 
85-86 

autocorrelation, checking 
for, 102-103 
average, 72 


B 

bell curves, 53-54 
best subsets regression, 
139-140 

binomial logistic regression 
analysis. See logistic 
regression analysis 


calculus, differential. See 
differential calculus 
Canceling Exponentials 
Rule, 22 

categorical data, 46 
converting numerical 
data, 46-47 
in logistic regression 
analysis, 167 
in multiple regression 
analysis, 147-149 
chi-squared (x 2 ) distri¬ 
butions, 54-55, 56, 
204-206 

coefficient of determi¬ 
nation ( R 2 ) 
adjusted, 124-126 
logistic regression 

analysis, 173-177 
multiple regression 

analysis, 119-126 
regression analysis, 
81-82 


coefficients. See specific 
coefficients by name 
columns, in matrices, 38 
concentration matrix, 145 
confidence coefficient, 
92-93 

confidence intervals, 
calculating 
multiple regression 
analysis, 

133-135, 146 
for odds ratio, 194-195 
regression analysis, 
91-94 

correlation coeffi¬ 
cient (R), 70 
general discussion, 
64-65 

multiple regression 
analysis, 120 
regression analysis, 
78-82 

critical value, 55 

P 

data. See also 

categorical data 
plotting, 64-65 
types of, 46-47 
degrees of freedom, 50-51 
delta (A), 29 

dependent variables, 14, 

67, 149-152. See also 
scatter plots 



differential calculus 
differentiating, 31-36 
general discussion, 

24-30 

Durbin-Watson statistic, 
102-103 

£ 

elements, in matrices, 38 
Euler’s number, 19, 198-199 
event space, 53 
Excel functions, 198 
Exponentiation Rule, 22 
exponents, 19-23, 200 
extrapolation, 102 


F distributions, 57-59, 
206-209 

freedom, degrees of, 50-51 
F-test, 129-133 
functions. See also 

probability density 
functions 
exponential, 19-23 
inverse, 14-18 
likelihood, 161-163, 171 
logarithmic, 19-23 
log-likelihood, 161-163, 
171-172 

natural logarithm, 20, 
200-201 

6 

graphs. See also 
scatter plots 
for inverse functions, 
17-18 

logistic regression analy¬ 
sis equation, 159 

H 

H 0 (null hypothesis), 48 
H a (alternative 

hypothesis), 48 


hypothesis testing, 85-90 
logistic regression 

analysis, 178-181 
multiple regression 

analysis, 128-132 
with odds, 194 

I 

identity matrices, 44 
independent variables, 

14, 67 

choosing best combi¬ 
nation of, 138-140 
determining influence on 
outcome variables, 
149-152 

logistic regression 

analysis, 164-167 
multicollinearity, 149 
structural equation 
modeling, 152 
interpolation, 102 
inverse functions, 14-18 
inverse matrices, 44, 
202-204 

L 

likelihood function, 
161-163, 171 
likelihood ratio test, 179 
linear equations, turning 
nonlinear equations 
into, 104-106 
linear least squares 

regression, 71-76, 115 
linear regression analysis, 7 
linearly independent 
data, 47 

logarithms, 19-23 
logistic regression analysis, 
8, 157 

accuracy of equation, 

assessing, 173-177 
adjusted odds ratio, 
192-194 


confidence intervals 
for odds ratios, 
194-195 

equation for, calculating, 
158-159, 170-173 
hypothesis testing, 
178-181, 194 
maximum likelihood 
method, 159-163, 
210-212 

odds ratio, 192-194 
predicting with, 182 
predictor variables, 

choosing, 164-167 
procedure, general dis¬ 
cussion of, 168, 190 
relative risk, 195-196 
scatter plot, drawing, 169 
logit, 190-191 
log-likelihood function, 
161-163, 171-172 


Mahalanobis distance, 133, 
137, 144-146 
matrices 

adding, 39-40 
general discussion, 
37-38 
identity, 44 
inverse, 44, 202-204 
multiplying, 40-43, 
201-202 

prediction intervals, cal¬ 
culating, 144-146 
maximum likelihood esti¬ 
mate, 162-163 
mean, 49 
median, 49 
multicollinearity, 149 
multiple correlation 
coefficient 
accuracy of multiple 
regression equa¬ 
tion, 119-121 
adjusted, 124-126 
problems with, 122-123 
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multiple regression 

analysis, 7-8, 111 
accuracy of equation, 

assessing, 119-126 
analysis of variance, 
128-132 

categorical data, using 
in, 147-149 
confidence intervals, 
calculating, 

133-135 

equation for, calculating, 
115-119 

hypothesis testing, 127 
Mahalanobis distance, 
144-146 

multicollinearity, 149 
prediction intervals, 
calculating, 

136-137 

predictor variables, 

choosing, 138-140 
predictor variables, 

determining influ¬ 
ence on outcome 
variables, 149-152 
procedure, general dis¬ 
cussion of, 112, 142 
scatter plot, drawing, 
113-114 

standardized residuals, 
143-144 

multiplying matrices, 

40-43, 201-202 

N 

natural logarithm function, 
20, 200-201 
nonlinear regression, 
103-106 

normal distributions, 53-54 
normality, assumptions of, 
85-86 

null hypothesis (H 0 ), 48 
numerical data, 46-47 


0 

odds, 190 

hypothesis testing, 194 
logit, 190-191 
odds ratio (OR) 

adjusted, 192-194 
confidence intervals, 
calculating, 

194-195 

general discussion, 
191-192 

outcome variables, 14, 67, 
149-152 

outliers, 101, 144 
overfitting, 149 

P 

partial regression 
coefficients 
calculating with Excel, 
209-210 

general discussion, 
116-118 

hypothesis testing, 127, 
129-131 

Pearson product moment 
correlation coefficient, 
79. See also correla¬ 
tion coefficient 
plotting data, 64-65 
population mean, 91 
population regression, 86 
populations 

assessing, 82-84 
confidence intervals, 
calculating, 

133-135 
Power Rule, 21 
predictions 

logistic regression 
analysis, 182 
multiple regression 

analysis, 136-137 
regression analysis, 
95-98 


predictor variables, 14, 67 
choosing best combina¬ 
tion of, 138-140 
determining influence on 
outcome variables, 
149-152 

logistic regression 

analysis, 164-167 
multicollinearity, 149 
structural equation 
modeling, 152 
prime ('), 32 
probability density 
functions 

chi-squared distribution, 
54-55, 56, 204-206 
F distributions, 57-59, 
206-209 

general discussion, 

52- 53 

normal distribution, 

53- 54 

tables, 55-56 
Product Rule, 23 
pseudo-K 2 , 173-177 

Q 

qualitative data, 46 
quantitative data, 46 
Quotient Rule, 21 

R 

R (correlation coeffi¬ 
cient), 70 

general discussion, 

64-65 

multiple regression 
analysis, 120 
regression analysis, 
78-82 

R 2 (coefficient of 

determination) 
adjusted, 124-126 
logistic regression 

analysis, 173-177 
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R 2 (coefficient of determina¬ 
tion), continued 
multiple regression 

analysis, 119-126 
regression analysis, 
81-82 

regression analysis 
analysis of variance, 
87-90 

assumptions of 

normality, 85-86 
autocorrelation, check¬ 
ing for, 102-103 
confidence intervals, 

calculating, 91-94 
correlation coefficient, 
calculating, 78-82 
equation, calculating, 
71-77 

equation, general 

discussion, 66-67 
interpolation and 

extrapolation, 102 
nonlinear regression, 

103- 104 

prediction intervals, 

calculating, 95-98 
procedure, general dis¬ 
cussion of, 68, 100 
samples and popula¬ 
tions, 82-84 
scatter plot, drawing, 
69-70 

standardized residual, 
100-101 

regression diagnostics, 
119-121 

regression equation 
calculating, 71-77 
general discussion, 

66-67 

linear equations, turning 
nonlinear into, 

104- 106 


relative risk (RR), 195-196 
residual sum of squares, 
73-74 

residuals, 71 

standardized, 100-101, 
143-144 

round-robin method, 
139-140 

rows, in matrices, 38 
RR (relative risk), 195-196 

e 

sample regression, 86 
sample variance, 
unbiased, 50 
samples, 82-84 
scatter plots 

differential calculus, 26 
for logistic regression 
analysis, 169 
for multiple regression 
analysis, 113-114 
plotting data, 64-65 
for regression analysis, 
69-70 

SEM (structural equation 
modeling), 152 
squared deviations, 
sum of, 50 

standard deviation, 51-52 
standardized residuals, 
100-101, 143-144 
statistically significant, 58 
statistics 

data types, 46-47 
hypothesis testing, 48 
variation, measuring, 
49-52 

structural equation 

modeling (SEM), 152 
subsets regression, best, 
139-140 
sum of squared 

deviations, 50 


T 

testing hypotheses, 85-90 
logistic regression 

analysis, 178-181 
multiple regression 

analysis, 128-132 
with odds, 194 
tolerance, 149 

u 

unbiased sample 
variance, 50 

V 

variables. See dependent 

variables; independent 
variables; scatter plots 
variance, 50-51 
variance, analysis of 
logistic regression 

analysis, 178-180 
multiple regression 

analysis, 128-132 
regression analysis, 
87-90 

variance inflation factor 
(VIF), 149 

w 

Wald test, 180 

x 

x-bar, 72 

y 

y-hat, 73 
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